Maverick Chips for the Next Silicon Generation

  Рет қаралды 27,081

TechTechPotato

TechTechPotato

Күн бұрын

Пікірлер: 171
@greggleswong
@greggleswong 29 күн бұрын
"19.5, 67.0, 45.0" "That's numberwang!"
@TechTechPotato
@TechTechPotato 28 күн бұрын
Aha haha best comment. Pinned
@bean_TM
@bean_TM 29 күн бұрын
Seems incredibly cool. But I'd need proof it actually works first
@Pavlobot5
@Pavlobot5 29 күн бұрын
Sounds like "trust me bro" performance
@fnorgen
@fnorgen 29 күн бұрын
With some of these modern efficiency optimized server CPUs being crammed full of those little e-cores, I don't really see the value proposition for this thing. They claim they'll have some super advanced branch prediction system that will drastically improve throughput, but that appears like a very tough problem to solve. And if there's enough money in HPC, it wouldn't surprize me to see Nvidia splitting their high end GPUs into low precision optimized and high precision optimized product lines. They have the budget for it.
@lalishansh
@lalishansh 26 күн бұрын
i remember Jim Keller saying CPU is most about branch predictions (now a days), this seems like branch prediction on steroids EXITING!!
@ProjectPhysX
@ProjectPhysX 29 күн бұрын
This sounds fancy but in practice is unlikely to work. HPC codes need a dumb vector processors with the FP64 vector compute throughput and the memory bandwidth and capacity to back it up. They don't need fancy dynamic branch prediction. HPC codes don't use branching for the most part, so there is nothing that smart branch prediction can even optimize. The telemetry collection for this branch prediction will probably even slow it down. If the thing doesn't support OpenCL/SYCL and instead needs recompiling, it is basically DOA. Recompiling for special hardware never goes smooth, there is always some detail that doesn't work and needs extra debugging, and developers don't have the time nor money to adjust their codes for another proprietary chip that does things different than industry standard. See Intel Knights Corner and how that worked out...
@henriksundt7148
@henriksundt7148 29 күн бұрын
You are right if it is only applied to standard, massively parallell tasks, like training of weights and biases in a static feed-forward structure / neural net. However, these architectures are to a large extend popular much because of the availability of this kind of uniform hardware. There are so many tasks that a) today is performed on the CPU, but could be faster (there are examples in the video's description), and b) approaches that are not popular because they are slow. In these domains, NextSilicon can have an impact.
@foobarf8766
@foobarf8766 29 күн бұрын
HPC does use branching in scientific applications, it is limited by GPUs being vector-only and IBM Power is still around for good reason, but you are right about the work in porting -- and that also goes against GPUs -- when those applications don't lend themselves to floating point, there are surely a few in meterology and climate science.
@ABaumstumpf
@ABaumstumpf 28 күн бұрын
And even with the branches: In decent code you are already reaching the hardware-limits: Either you fully saturate the ALUs or the memory bandwidth. (or in bad cases like we have had today - networking... 250 workers can put some load on the system).
@TheCebulon
@TheCebulon 29 күн бұрын
There is one important number missing! And it means a lot to me: 42.
@kehoste
@kehoste 6 күн бұрын
@TechTechPotato I can't wait for a more detailed deep-dive on this. The stuff they had on display at their booth was by far the most impressive thing I saw on the Supercomputing'24 exhibit floor (though I definitely missed a bunch of things, no doubt).
@bernadettetreual
@bernadettetreual 29 күн бұрын
I find it hard to believe that this is true. It's like the theoretical advantages of the Java VM over AoT-compiled code. They never materialized.
@MrHaggyy
@MrHaggyy 29 күн бұрын
This looks really interesting in combination with Mojo. The big problem in HPC is getting the right ALU in the right configuration. I'm curious how they get that much performance out of branch prediction. In the optimization problems, I know we had to explore a search space until we hit a sufficiently low error for example. The branches were, run this stuff, and occasionally check if it's sufficient. From my understanding utilizing the ALU correctly by compiling certain functions like matrix multiply with the hardware in mind would be much more effective, as it cuts down the serial path you are parallelizing. The combination of AI and HPC will be an interesting market. An engineer can explore a vastly bigger search space if you train AI to do textbook engineering and give it the right amount of HPC. This kind of workflow is already done in fluid dynamics for vehicles and buildings and has been used to optimize combustion in ICE's. I'm also pretty sure turbine manufacturers use it as well. IBM tried to innovate that space with variable precision. The idea was to brute force the search space in low resolution, sort the results, and recompute the areas of interest with higher precision. But I don't think it got adopted that well. Probably just to complex to handle.
@foobarf8766
@foobarf8766 29 күн бұрын
The SIMD instructions on my general purpose (AMD) CPU give good baby-step giant-step performance, but I haven't compared with the latest Power ISA to really say. Stuff like that which is heavy on the branching doesn't lend itself well to GPUs. (edit: %s/prediction/branching)
@MrHaggyy
@MrHaggyy 24 күн бұрын
Well, it depends on how you branch. I played around with fixed-step Euler and Runge Kutta on my GPU, and if you run the same instructions on a large enough dataset, they fit really well. The same is true for back-propagation-based algorithms. But it gets tricky when you have something like ODE45, which has variable step size and conditional branching. On those, a view dozen cores that combine the benefits of different initial conditions and variable step size/branching would be the best.
@autohmae
@autohmae 29 күн бұрын
HPC is not a small market, if they can get a good chunk, it's a good niche
@nextlifeonearth
@nextlifeonearth 29 күн бұрын
So it's basically a super wide out of order execution pipeline with speculative execution on steroids. So instead of like 4 fpus per core they have, say, 256 and a massive execution buffer (I expect the hbm is for that) to keep them fed. Their isa is probably defined for ooo, which is why the recompile is enough. The branch predictor simply learns and works farther ahead than any current cpu. Or that's what I'm getting from this.
@elad_raz
@elad_raz 27 күн бұрын
No ISA, it's a dataflow
@soonts
@soonts 29 күн бұрын
GPUs are only “fixed” within one dispatch / one draw call. However, many practical compute problems/3D scenes are split into thousands if not millions dispatches / draw calls, and CPU running code can adjust size of dispatches / length of draw calls in runtime. BTW, on Windows it is critical to control count of in-flight compute thread groups / draw calls because the OS insists GPUs should stay responsive at all times even when loaded, and has the timeout detection and recovery (TDR) feature in the OS kernel to enforce the policy. To be fair, until work graphs arrived in D3D12, said flexibility was tricky to implement. D3D11 supports indirect dispatches / draw calls, queries to track completion of things, other queries to measure time spent computing / rendering things, but developers need to build their custom pipelines on top of these primitives.
@capability-snob
@capability-snob 29 күн бұрын
I'm all for divergence-busting techniques, even if quite a lot of HPC workloads don't absolutely need them. I suspect the bigger challenge with general vector compute is around memory access, Ian mentioned this briefly but it looks worth digging into.
@dddslimebbb
@dddslimebbb 29 күн бұрын
I'm seeing a lot of mentions of branch predictions, but this seems to be a misunderstanding (or am I the one misunderstanding?) My reading is that the code "flow" is analyzed, and this is used to allocate more compute "width" in that area. Whereas branch prediction is about guessing what comes next on a branch so you don't starve your pipeline. Branch pradection may be an important part of this chip but it isn't what's being showcased.
@clehaxze
@clehaxze 28 күн бұрын
That turns the computation problem into VLIW. Maybe the clip is dynamically assigning how much execution units for each branch? So less taken branches gets 4 units while the most common code path gets 16. I can see this work somewhat. But instruction retiring and branch prediction error panelty is going to be nuts. And writing a compiler for this sounds like a yucky problem (assuming they use some annotation in the ISA).
@ullibowyer
@ullibowyer 29 күн бұрын
When I have a set of ALUS and some branchy code the proportion of ALUs running the rare path automatically goes up as the rare path gets more common. This gets more complicated with vector units which suffer a large slowdown when a small number of lanes follow a rare path but if that's what is being addressed here then the message is being lost/oversimplified. 😢 On the other hand data flow programming is awesome so nice to hear something which sounds like that 🎉
@TechTechPotato
@TechTechPotato 29 күн бұрын
It's something we'll go into in time as we dive deeper, for sure
@mytech6779
@mytech6779 29 күн бұрын
Intel Xe has good 64b performance, they purposefully favored HPC rather than AI training, and is used in Argonne labs Aurora computer. Xe2("Battlemage" when in the Arc graphics form factor.) should be even better with 64b int support. Intel oneAPI already offers write once compile everywhere SyCL/C++, across venders and device types(AdaptiveCPP is another SyCL compiler alternative to oneAPI). nVidia double precision fell off a cliff years ago (like 10% of fp32 these days, basically just software support rather than native 64b register size); my old 2012 AMD w7000 GCN1.0 had DP speed exactly half of SP speed due to 64b registers that were split for 32b.
@BozesanVlad
@BozesanVlad 29 күн бұрын
I'm curious why ARC is FPGA, at least at driver level to "make" it work as a GPU
@rightwingsafetysquad9872
@rightwingsafetysquad9872 29 күн бұрын
Nvidia has 64 bit performance that is 1/4 that of 32 bit on their fattest chips. The ones that make it into GeForce cards do not. GA100 had full 64b support, GA102 did not.
@foobarf8766
@foobarf8766 29 күн бұрын
Goes to show the cost of branched compute tasks on GPUs. For baby-step giant-step I find better performance on Ryzen than can be extracted from any Radeon.
@xpk0228
@xpk0228 29 күн бұрын
Blackwell cut a lot on FP64 since their focus is now on AI.
@mytech6779
@mytech6779 29 күн бұрын
​@@rightwingsafetysquad9872 Ah , yes x100 chips are actually the expected 2:1 of physical 64bit. but all the others(even in the enterprise line) have something near 40:1 drop off for double precision at least since Pascal. (I'm only referring to compute speed inside the GPU without memory bottleneck considerations)
@MoonDweller1337
@MoonDweller1337 29 күн бұрын
How is it different from traditional branch prediction and speculative execution?
@TechTechPotato
@TechTechPotato 29 күн бұрын
Both end up going towards a fixed compute array. Here the size of the compute array changes given the workflow.
@TheWunder
@TheWunder 29 күн бұрын
​@@TechTechPotato Thank you Dr Potato
@lucasfernandesgrotto6279
@lucasfernandesgrotto6279 29 күн бұрын
​@@TechTechPotatoand they do that without being FPGA?
@quantumbacon
@quantumbacon 29 күн бұрын
Guess they'll have fixed silicon with various paths/pipelines/SE&BPwidths . Then the compiler wraps in some path optimiser tables that cause registers to get used it a programmatically way, which are attached to the various pipelines. So at some point the path lookups get evaluated and compete for optimal use. Feels like it doesn't work for modelling where 'every solution' gets calculated. Or anything like monte Carlo. Also Devs at department of energy know how to optimise, or they are using machine learning to assist optimising code. Makes sense to let Nvidia GPUs run the code optimisers.
@lucasfernandesgrotto6279
@lucasfernandesgrotto6279 29 күн бұрын
@@quantumbacon thank you!
@keyboard_g
@keyboard_g 29 күн бұрын
The latest versions of the .Net runtime profile and second and third generation jit the code to tune hot paths and make better assembly instructions.
@dddslimebbb
@dddslimebbb 29 күн бұрын
Would be interesting to see how this could integrate with MLIR (the LLVM magic that Mojo uses). I also wonder if, with sufficient support, this could be well suited for accellerating functional programming languages without the traditional FP when translating to something that will run on traditional hardware. That might make some HPC-using mathematicians *very* happy.
@djsnowpdx
@djsnowpdx 29 күн бұрын
Your video about all big core smartphones cleaned up how I think about Apple CPUs now. I just disregard the little cores. So the M4 is fast, but with only 3-4 big cores, you might consider the M4 Pro for any CPU-intensive workflows, and the M4 Max is not much better, so only buy that if you need more GPU than M4Pro, and expect a slight battery life hit. Thanks Dr. Cutress!
@vasudevmenon2496
@vasudevmenon2496 29 күн бұрын
I still find it hard to solve fp64 conversion with mantissa or exponential part when it's a negative number. I remember Pascal had much better fp64 performance than Maxwell and my friends 1050ti was way faster than GTX 980 in cuda peak workload. Great to see this approach.
@Chriva
@Chriva 29 күн бұрын
1:10 No amount of bits can please the real hardcore people. Check out arbitrary precision maths 😂 (Hunt for primes and pi decimals in particular)
@incription
@incription 29 күн бұрын
yep its literally unavoidable, although you can use integers in place of large floating points, you just have to adjust the formulas
@foobarf8766
@foobarf8766 29 күн бұрын
It's true I need a 160 bit integer math processor, I don't even care about this floating point stuff, I'm not trying to make a bad poetry machine
@levygaming3133
@levygaming3133 28 күн бұрын
@@foobarf8766 what kinda math are you doing where you’re not ever going to get decimals? Or even just _use_ decimals? Prime number hunting the slow way? (Especially b/c I’m pretty sure that prime number hunting takes advantage of AVX floats.)
@tristan7216
@tristan7216 29 күн бұрын
Sounds like branch prediction at a larger scale, reorganizing the placement of code on chip to optimize data flow. They should be able to measure the performance per Watt boost on open source science codes, so I'd expect it works pretty well if they've done that. It'll depend on the code though. Interesting.
@countdown4100
@countdown4100 29 күн бұрын
8:05 "It's not that. They've told me it's not that." Yeah? Well, then what is it?
@Cybot_2419
@Cybot_2419 29 күн бұрын
Does this only support OMP target or is there something simular to CUDA/HIP to program this? Im wondering if its worth it to port GPU codes to this (that use CUDA/HIP and not omp target) that are mainly memory bandwidth constrained. Or is this more intended for codes that are CPU only?
@jamesdk5417
@jamesdk5417 29 күн бұрын
As an older gamer, you always surprise me how little I know about things outside of gaming. Thanks very much for
@TheGreenPianist
@TheGreenPianist 29 күн бұрын
nice to see that FP64 is not ignored all the way in these times 😅 our NWP models are increasingly more mixed FP32/64 precision but a large part of the code will always need just many F64 flops
@Swordhero111
@Swordhero111 29 күн бұрын
Is this just cgra with extra steps?
@rb8049
@rb8049 29 күн бұрын
I remember the Fairchild multi chip module CPUs in the 1980’s.
@darveshgorhe
@darveshgorhe 29 күн бұрын
What's the difference between the runtime optimization performed by Maverick 2 and something like a JIT compiler or branch prediction? Is the idea that the more used code paths actually use more hardware where was JIT compilers and branch prediction create heuristics for code paths in software?
@henrycobb
@henrycobb 29 күн бұрын
Intel promised that Itanium just needed a better compiler. How'd that work out?
@foobarf8766
@foobarf8766 29 күн бұрын
They positioned that against the IBM Power which was like 20 years of compiler work ahead, so not bad considering? But OpenCL is a thing now so maybe this had a better chance?
@mytech6779
@mytech6779 29 күн бұрын
Itanium relied 100% on the compiler. That was the whole point, to do all the pipelining stuff in software at compile rather in hardware at every execution, and thus a net savings on silicon area and power consumption.
@HansCNelson
@HansCNelson 21 күн бұрын
IF runtime adaptive acceleration really works (big if), is it fair to think that it could jump big parts of the CUDA moat?
@rwantare1
@rwantare1 29 күн бұрын
And then a small code change destroys your performance because their (proprietary?) runtime optimiser no longer understands what you're trying to do. This already happens with speculative execution, just that programming for magic performance gains is no fun.
@nextlifeonearth
@nextlifeonearth 29 күн бұрын
To my understanding speculative execution is exactly what they're doing, but bigger than ever before. A giant branch predictor, a ton of fpus per core fed by this branch predictor. The recompile is probably for their own isa that they designed for ooo.
@elad_raz
@elad_raz 28 күн бұрын
@@nextlifeonearth This is why we don't follow instructions. No processor core, no execution pipeline, no ISA. Stay tuned for a technology launch in a few months.
@NickChapmanThe
@NickChapmanThe 29 күн бұрын
Appreciate the perspective. The late disclosure seemed a little disingenuous.
@karehaqt
@karehaqt 29 күн бұрын
Ian, please talk about whats happening with Super Micro, shares down 30% due to their auditors Ernst & Young resigning today. Tech press seems oddly quiet about the whole thing which has been ongoing for months.
@TechTechPotato
@TechTechPotato 29 күн бұрын
Company investor relations tend to only talk to the investor press. It's rare that Tech Press get a call about share prices
@karehaqt
@karehaqt 29 күн бұрын
@@TechTechPotato It just seems weird to me that nobody has even spoke of it, especially since the DoJ started investigating them for alleged accounting violations. I'm just wondering if this is going to tank their AI dreams.
@muhdiversity7409
@muhdiversity7409 29 күн бұрын
@@karehaqt I watched something a few weeks ago that explained exactly how naughty they were being. Something to do with multiple companies colluding to inflate the books. I think that probably explains the media blitz they have been doing across YT talking about their DC solution's. Probably in an attempt to drown out the bad news.
@todorkolev7565
@todorkolev7565 29 күн бұрын
I just watched a PR piece (L1tech) about Super Micro and I was still shocked people see them as a legit company, because I remember when we had to replace all our servers because they were bugged with Chinese spy chips... SuperMicro is greasing the right wheels, apparently!
@adul00
@adul00 28 күн бұрын
This looks awfully similar to profile-guided optimization (PGO), which collects runtime information to help ordinary compiler (like GCC) optimize code better for that code execution pattern / scenario.
@erictayet
@erictayet 29 күн бұрын
I think I know what it is. Quick background, I specialise in DSP when I was in school, so I've worked with fixed-precision DSP, MatLab & gcc in the past. Based on what I'm hearing, I will generalise this chip as a General-Purpose multi-precision DSP that support pipeline & branch prediction. Imagine a DSP that only runs Intel SSE & AVX with SMT support but with on-die HBM, with a compiler that has a front-end like MatLab but can directly target this new chip. Interested to learn how wrong I am when the chip comes out.
@elad_raz
@elad_raz 27 күн бұрын
@@erictayet It is a dataflow hardware, and stay tuned to learn more!
@erictayet
@erictayet 24 күн бұрын
@@elad_raz so like a state machine as implemented in FPGA to simulate Kmap? But each state machine has an ALU/FPU to run in a neural net rather than a simple comparator? Just shooting wildly here. I have worked with Altera FPGA in my work and it's a completely different way of thinking about how the machine worked. Certainly not a Von Neumann machine which I'm used to code for.
@reinerfranke5436
@reinerfranke5436 29 күн бұрын
Seem to me a clever SW solution looking for hardware demonstration to be later also target "legacy" CPU and GPU mixtures. As i learned from Spice circuit simulation on GPU that part of the code is easy to port to small graph flow of some hundred lines but anything sparse matrix get hit by long memory latency. FEM is possible a different target where very small kernels are at 100% compute and the HBM only feed huge trunks of data departionings. Still all have memory/compute separation. I think the real thing is coming with stacking memory where the interconnect is counted in millions, each transfer billions per seconds, not hundreds transfering 10s of billions. This will break the memory limit into new applications of code.
@TheBackyardChemist
@TheBackyardChemist 29 күн бұрын
Do they have a good OpenCL driver? I am not going to write vendor-specific code for the product of a company that might do a nitrogen triiodide impression and go poof at any moment.
@TechTechPotato
@TechTechPotato 29 күн бұрын
That's the beauty, the code here isn't vendor specific.
@TheBackyardChemist
@TheBackyardChemist 29 күн бұрын
@@TechTechPotato I am not convinced yet but I hope they succeed
@MrGarrax
@MrGarrax 29 күн бұрын
Sounds interesting - an accelerator that adapts over time to your code and improves performance and efficiency, but if there is a bug inside this system that will be very unfunny to debug. Well will have to wait and see thx 4 the news.
@ManuFortis
@ManuFortis 29 күн бұрын
It may not be their intended usage, but I have an idea for a game I've wanted to make for a long time, that I think this technology from Next-Silicon will make incredibly easy for me to accomplish now, in comparison to before. Before, I was looking at the potential of having to deploy servers just for hosting background logic going on in the game, not even multiplayer aspects. This just flipped the table for me. If it can be integrated well enough with the kind of system I have in mind right now... It could be done all in one server. Before, I was looking at a potential of a cluster, and gasping at the prices. So instead I decided to downgrade some of the graphics that I would end up using, because it would at least free up some compute in the cpu and gpu. But with this... that's not necessary anymore. If I understand correctly that is. If I do understand correctly, I can offload all the game logic going on onto the accelator, allowing the GPU and CPU involved to do their own tasks separately. Or in the worst case scenario, it merely just makes the operation of all that game logic much more effecient while still being a load on the cpu and gpu as well to some extent. But if that's the worst case, I can work with that. I think. What's the game idea? As much as I would love to share, it also would be a dang shame if said idea were poached. I will instead say this at the very least. Imagine an MMO where everything you do actually affects everyone else, and not just through some premade restrictive scripts, but actual logic dictating what the most likely scenario is next. When you pull a pail of water from a river, it actually reduces in amount flowing behind you. If you chop down a tree, it actually stays chopped down until a new one can grow to replace it, properly. Not spawning in on a set timer. If you pull too much water from that river, the tree may not grow at all due to lack of ground water in the area now. (If taken far enough.) The way I was looking at the likely path for coding something like that, I was met with the need for parallelism. And a lot of compute capability. You aren't running something like that on a typical CPU, to put it bluntly. And GPU's in the consumer market, well... not happening there either. So I started to look at Accellerators. And that's how I got to the server clusters. I put that game idea on hold, because I just cannot even begin to afford to do something on that scale. But with this Maverick chip. I feel like Pandora seeing hope at the bottom of the box.
@sambojinbojin-sam6550
@sambojinbojin-sam6550 29 күн бұрын
"It's not that. We've said it's not that." "Ok, it's kinda that, but with a patent. Big difference."
@Quarky_
@Quarky_ 29 күн бұрын
17:20 is this a 20 min ad?
@quibster
@quibster 26 күн бұрын
so this is like Adaptive ASIC, but they are also saying for 100% sure they will do the software and not lump it on the customer? could this be the way to go if you "just want more hpc"?
@platin2148
@platin2148 28 күн бұрын
As longs as any input isn't serially dependent on any other
@DanFrederiksen
@DanFrederiksen 29 күн бұрын
what chemistry did you need FP64 for? 32bit covers quite a range
@JoeHacobian
@JoeHacobian 27 күн бұрын
So they basically made a (jit-next meets the v8 engine) processor for general compute
@xpk0228
@xpk0228 29 күн бұрын
Well this seems like they have to produce working compilers for their hardware, and that is really hard. I guess we should wait and see, but intel tried with IA-64 and even them could not get the compilers working.
@gadlicht4627
@gadlicht4627 29 күн бұрын
A lot of best models that us neural networks as part of the model, but not full model so they will be continued use for improvement in non-ML part. For example, if you know the exact physics of a simulation or the laws it obeys, using ML might be frankly stupid if your computer can handle computation of those exact terms well. It may even take less computation power as you get rid of superfluous things, and instead everything goes to actual calculation. If you do not know exact physics or laws, or its computationally impossible, you can still get a boost by modelling what you can model and using neural network to modify that model at point in a hybrid approach. The model not based on neural network can lead to neural network being more grounded in reality (so better results), needing less training as grounded, faster at times, and more. This is very much cas eby case thing
@thiagofreire4496
@thiagofreire4496 29 күн бұрын
Hi, Ian. Does RISC-V already have instructions equivalent to Neon, SVE and SVE2 of ARM CPUs?
@TechTechPotato
@TechTechPotato 29 күн бұрын
RVV goes down that path :)
@juancarlospizarromendez3954
@juancarlospizarromendez3954 29 күн бұрын
is not there GDDR7 memory?
@thegeforce6625
@thegeforce6625 29 күн бұрын
I’m probably wrong, but this kinda reminds me of those Transmeta Crusoe chips from the early 2000’s.
@proesterchen
@proesterchen 29 күн бұрын
Sounds IA-64-like in its reliance on compiler and predication at least for the initial setup, while the hardware reconfiguration must have really terrible latency if they go with split resources on branches rather than just redoing the ops using the full hardware on a miss.
@xwingfighter999
@xwingfighter999 25 күн бұрын
So my favourite density fucntional theory package running at the speed of a GPU? Without having to ask the devs to rewrite all their codebase to CUDA? I am interested.
@TheLkdude
@TheLkdude 27 күн бұрын
SRC - systems developed similar technology under reconfigurable computing technology
@evdrivertk
@evdrivertk 26 күн бұрын
I'm thinking that the 800 pound gorillas (Intel/AMD) are going to come out with special compilers that convert your C++/Fortan code to their architecture without all the hand-porting efforts.
@bayanzabihiyan7465
@bayanzabihiyan7465 29 күн бұрын
Doesn’t MI300X (and MI300A) have surpurb FP64 performance while having the memory BW to support it? You mentioned Nvidia, but AMD is I believe a bigger player in HPC, they are powered some of the worlds best HPC super computers.
@TechTechPotato
@TechTechPotato 29 күн бұрын
Based on total compute, yes, but AMD is only in a small handful of (top) systems.
@ProjectPhysX
@ProjectPhysX 29 күн бұрын
Yes MI300X is 82 TFlops vector FP64, and 163 TFlops matrix FP64. That thing is a beast and it will be hard for a startup to become even remotely competitive.
@xpk0228
@xpk0228 29 күн бұрын
AMD will probably do better than NVDA in HPC since they did not gut their FP64 path like blackwell did. Also there is less of a software issue there.
@artifactingreality
@artifactingreality 29 күн бұрын
I have been imagining such a chip myself. A self-programming fpga if you will. Amazing that someone is going to build it.
@alexg50446
@alexg50446 21 күн бұрын
Is it only better at FP64 in performance/power, but not lower precision?
@Veptis
@Veptis 29 күн бұрын
modern NPUs only do INT8 (plus a bit more FP on the DSPs)... so I am now wondering if you can write some kernels to do fp32 math with the int8 MACs
@ProjectPhysX
@ProjectPhysX 29 күн бұрын
Possible yes, but throughput will be awful, especially with emulation support for denormals. So it doesn't really make sense.
@Veptis
@Veptis 29 күн бұрын
@ProjectPhysX I have seen doom run on worse hardware... But this will be my summer project for the winter
@skypickle29
@skypickle29 29 күн бұрын
How is this different than branch prediction? I even remember the DEC alpha which had a processor monitoring the cpu for metrics like this. Unless the processor can reconfigure an fPGA that is optimal for the observed calculations, then rewrite the code to maximize efficiency on the fly - then the design will not be optimal.
@ABaumstumpf
@ABaumstumpf 28 күн бұрын
I mean that is what branch-predictors are already doing. And everything you and them have presented so far sounds exactly like a CPU with an FPGA and some fixed-function blocks - which falls flat in terms of performance compared to the more normal vectorisation-approach for most cases, but can be faster if the workload is not your normal memory intensive task but rather you need some more complex operations and have extra blocks for that (some extra trig-hardware etc). And really? Code is mostly taking the most-likely path? XD
@Matlockization
@Matlockization 29 күн бұрын
It was very interesting that you would display who and how many accelerators were used. I don't see why Intel can't populate their P & E cores in a grid with GPU cores right now. However, I think AMD is closer to this practically than Intel. Obviously, I have concerns about latency.
@dankodnevic3222
@dankodnevic3222 29 күн бұрын
After years of reading about miracle devices, which turned flop, I'm rather to believe when I see it. Related to the precision issue, I would like to see scalable FPU, that goes beyond FP64, in hardware, when needed (high order polynomials, etc.), more than some magical branching prediction.
@jimtekkit
@jimtekkit 29 күн бұрын
I'm hoping like hell that Radeon will bring back some FP64 compute performance to the masses with UDNA. Nvidia severely nerfed it with Maxwell and even many Quadros are nerfed. The upsell is insanely steep. Radeon aren't much better right now with their focus on CDNA for that type of workload.
@MrMrMrMrT
@MrMrMrMrT 26 күн бұрын
Isn’t it cost disadvantages? From a power draw aspect
@1introvert_guy
@1introvert_guy 29 күн бұрын
12:15 this is such a marketing graph (well because it it!). But I hate these graphs :/ especially because I can't see the numbers or more details.
@RwilliaMHI
@RwilliaMHI 28 күн бұрын
It's not an fpga+asic programming another fpga within the SoC, like it wasn't comingling of funds at ftx crypto.
@sameeranjoshi1087
@sameeranjoshi1087 29 күн бұрын
Good one
@acasccseea4434
@acasccseea4434 29 күн бұрын
Doing disclosures at the end is dodgy... If you don't want to spend watch time, at least put a text up...
@kamilhorvat8290
@kamilhorvat8290 29 күн бұрын
Is this Transmeta CPUs reinvented?
@PterAntlo
@PterAntlo 29 күн бұрын
I wish them the best, but that sounds very much like what Intel said with Larabee: you don't habe have to adapt your program, just recompile it and our compiler/lib/jit will do the rest. And well, that didn't work out as well as everyone hoped.
@jedijackattack3594
@jedijackattack3594 29 күн бұрын
So its a feed foward DPU. We have had these for ages and I don't think its going to help for most hpc tasks.
@foobarf8766
@foobarf8766 29 күн бұрын
If you mean the IBM/DARPA thing that was never going to go retail, but now that OpenCL is a thing, this might have a chance?
@MaxHaydenChiz
@MaxHaydenChiz 29 күн бұрын
I really want to understand how this hardware works. Is it a variation of a CGRA? Regardless, extraordinary claims, require extraordinary evidence.
@foobarf8766
@foobarf8766 29 күн бұрын
Also curious but is it really that extraordinary? IBM made similar leaps between Power generations, 4096 entries in the Power10 TLB, the Intel/AMD entry to the HPC space with GPUs is because of their price point not capabilities.
@moienahmadi2377
@moienahmadi2377 7 күн бұрын
Founder of NextSilicon is Elad Raz. According to Founders Village: "Mr. Raz served in the elite 8200 intelligence unit of the Israel Defense Forces". The more you know... ⭐
@TechTechPotato
@TechTechPotato 7 күн бұрын
Israel does have manditory military service. A lot of tech people there have been in intelligence forces one way or another - it's why Israel is a tech hub.
@Mark_Williams.
@Mark_Williams. 28 күн бұрын
Remember these numbers. Look at this cool new tech! Numbers under embargo... bah! lol Looks very cool though. Gives me vibes of Intel's alleged Royal core project with rentable units. An achitecture that dynamically adapts to the workload to improve performance. Interesting stuff!
@incription
@incription 29 күн бұрын
It doesn't accelerate AI in anyway does it? Just to make sure
@TechTechPotato
@TechTechPotato 29 күн бұрын
Only at full precision, not reduced precision modes
@quantumbacon
@quantumbacon 29 күн бұрын
Ian, I think you might be giving people the impression that FP64 makes calculations at 64bit precision. this is incorrect.
@TheoneandonlyRAH
@TheoneandonlyRAH 29 күн бұрын
this is nice!
@JohnJohn-ts6ux
@JohnJohn-ts6ux 29 күн бұрын
Hi sir love your videos very much, I admire your hard work thank you so much again, could you please do a video, metiatek CPU 9400 v snapdragon Elite, because I'm thinking getting Samsung ultra 25, or possibly oppo flagship high end smartphone 9400 mediatek CPU, which one performs better thanks for your time keep it up😀😀
@TechTechPotato
@TechTechPotato 29 күн бұрын
I'm waiting for a D9400 and S8E sample
@JohnJohn-ts6ux
@JohnJohn-ts6ux 29 күн бұрын
😊ok thanks
@philflip1963
@philflip1963 29 күн бұрын
The Road Not Taken By Robert Frost Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth; Then took the other, as just as fair, And having perhaps the better claim, Because it was grassy and wanted wear; Though as for that the passing there Had worn them really about the same, And both that morning equally lay In leaves no step had trodden black. Oh, I kept the first for another day! Yet knowing how way leads on to way, I doubted if I should ever come back. I shall be telling this with a sigh Somewhere ages and ages hence: Two roads diverged in a wood, and I- I took the one less traveled by, And that has made all the difference.
@oj0024
@oj0024 29 күн бұрын
Does the number 0.8373 mean anything to you?
@jonathanjones7751
@jonathanjones7751 29 күн бұрын
Ponte Vecchio did 52TFLOPS of FP64 but intel sunset it. Was that more hardware or software that limited its adoption?
@TechTechPotato
@TechTechPotato 29 күн бұрын
A bit of both, but also the theoretical memory bandwidth was almost impossible to achieve. The Chips and Cheese team even worked with Intel for their coverage and struggled to get >50%.
@jonathanjones7751
@jonathanjones7751 29 күн бұрын
The memory bandwidth is a great point. 47 tiles or soemthing and were seeing memory issues with Foveros with ARL. thank you for the reply. Hopefully it can get remedied for Falcon Shores if that is still an HPC part.
@xpk0228
@xpk0228 29 күн бұрын
It's more like the design of PVC is just not good. from what we see in Aurora the 52TFLOPS is peak and unsustainable under real life conditions. MI250X on the other hand can do 45 Tflops consistently in Frontier.
@acasccseea4434
@acasccseea4434 29 күн бұрын
Sounds like branch prediction😅
@MasamuneX
@MasamuneX 29 күн бұрын
what if we made and asic that just "changes"
@alexcastas8405
@alexcastas8405 29 күн бұрын
'applications run orders of magnitude faster' ... big claims
@MrAndrzejWu
@MrAndrzejWu 28 күн бұрын
ok it sounds interesting :)
@kilngod1943
@kilngod1943 28 күн бұрын
AMD accelerators get 3x better fp64 compute than NVidia, there is a reason national labs are buying AMD based super computers.
@firsttyrell6484
@firsttyrell6484 29 күн бұрын
This chip looks like a nightmare to optimize for. Look, on the first run this part of code was slow, I'm going to optimize it. On the next run this part of code does not matter anymore due to hardware magic (optimization), but the code is still slow in some other place instead, back to square one.
@pcoverthink
@pcoverthink 28 күн бұрын
L1 size is a huge red flag for me. Money can buy good nodes and a lot of hbm but this l1 amount sounds like bs
@AhmadAli-kv2ho
@AhmadAli-kv2ho 29 күн бұрын
Theres 256floating points?
@lbgstzockt8493
@lbgstzockt8493 29 күн бұрын
Theoretically you can have any power of two for your size, it just gets really impractical really fast. Pretty much nobody does more than 256 bits.
@LogioTek
@LogioTek 29 күн бұрын
Radeon VII still good then?
@TechTechPotato
@TechTechPotato 29 күн бұрын
Efficiency ain't great, and the software stack needs work, but zoom zoom
@LogioTek
@LogioTek 29 күн бұрын
@TechTechPotato Yea tell me about AMD software/driver stack. I sometimes get AMD driver crashes just from playing KZbin videos on my 7950X3D iGPU. When I actually edit videos it becomes a nightmare. From my tinkering several years ago, Radeon VII efficiency doubles from reducing core and memory clocks by 25% each.
@RicoElectrico
@RicoElectrico 29 күн бұрын
I wonder if Intel will acquire them only to sell off 5 years later.
@Server0750
@Server0750 29 күн бұрын
@ultraveridical
@ultraveridical 25 күн бұрын
Another video, another mention of "clients". These are becoming ads more and more, and with the disclosure near the end.
@TechTechPotato
@TechTechPotato 25 күн бұрын
This video isn't an ad. But good try though. I'm an analyst and consultant. All my clients, past and present, are listed in the description. I'm very open about this.
@vogue43
@vogue43 29 күн бұрын
All that about flow was pretty much the ... before profit. It explained nothing. Magic happens, perf goes to the moon, trust me bro.
@cj09beira
@cj09beira 29 күн бұрын
kinda of a shame CDNA wasn't at all mentioned when its much more HPC focused than the Nvidia counterparts
@TechTechPotato
@TechTechPotato 29 күн бұрын
More content to come ! :)
@rb8049
@rb8049 29 күн бұрын
Does MATLAB run on it?
@cj09beira
@cj09beira 29 күн бұрын
@@TechTechPotato Btw, any plans to talk about SOI?, its been absent of late since GF gave up on 7nm, with all this new quest for high performance i wonder why what seems like a "easy" avenue for a frequency and or efficiency boost isn't being used.
@JorgetePanete
@JorgetePanete 17 күн бұрын
it's*
@Squilliam-Fancyson
@Squilliam-Fancyson 2 күн бұрын
Die to die for:)
@foobarf8766
@foobarf8766 29 күн бұрын
Intel and AMD should be here with products like this... where are they? Smoking blockchains behind the bike sheds again?
@DS-pk4eh
@DS-pk4eh 29 күн бұрын
I thought AMD had good hardware with 64bit FP support
@shieldtablet942
@shieldtablet942 29 күн бұрын
This smells like vaporware to get investor money. If you have this (which seems more like SW than HW), you are making bank or getting bought by billions and landing at AMD or Intel. Even compiler auto parallelization (which this seems to be) has not been cracked for 15y+. The best we have is Nvidias threading model and stuff like OpenMP, which when I worked in the field was always loosing to MPI.
@TechTechPotato
@TechTechPotato 29 күн бұрын
That's why it's not that :)
Cache Goes on Top, or Cache Goes on Bottom? The X3D Dilemma
23:36
TechTechPotato
Рет қаралды 24 М.
Qualcomm's v8 License, Cancelled by Arm!
19:19
TechTechPotato
Рет қаралды 60 М.
ТЮРЕМЩИК В БОКСЕ! #shorts
00:58
HARD_MMA
Рет қаралды 2,7 МЛН
Creative Justice at the Checkout: Bananas and Eggs Showdown #shorts
00:18
Fabiosa Best Lifehacks
Рет қаралды 16 МЛН
How Physicists Broke the Solar Efficiency Record
20:47
Dr Ben Miles
Рет қаралды 826 М.
Building the most powerful watercooled PC in a Toaster
43:21
Billet Labs
Рет қаралды 428 М.
I never understood why you can't go faster than light - until now!
16:40
FloatHeadPhysics
Рет қаралды 4 МЛН
How Mozilla lost the Internet (& what's next)
14:09
TechAltar
Рет қаралды 177 М.
Building a Guided Rocket to Hit Mach 3
42:19
BPS.space
Рет қаралды 451 М.
Do we really need NPUs now?
15:30
TechAltar
Рет қаралды 773 М.
The Turbo Charged Laptop. Literally.
27:02
Linus Tech Tips
Рет қаралды 1,8 МЛН
Linux Kernel 6.12 | This is Historic
1:07:22
Maple Circuit
Рет қаралды 90 М.
The Slow and Deserved Downfall of Intel
22:18
Vex
Рет қаралды 241 М.
The Economics of AI are Failing, But We Can Fix It (With Lasers)
28:32