This is How IBM Will Revolutionize PC Gaming

Рет қаралды 183,171

Күн бұрын

Пікірлер: 965

@davidwilliamscomau 3 жыл бұрын

Cash is super important to performance as you can't buy anything without it. Cache is also important. You need cash to buy more cache :P

@AndreVanKammen 3 жыл бұрын

My fortune is mostly write through cash, so not enough money to buy cache.

@TechTechPotato 3 жыл бұрын

It comes in from the day job, falls through your pocket, right into your bills

@MekazaBitrusty 3 жыл бұрын

I believe that cache is pronounced caish, not cash.

@alfredzanini 3 жыл бұрын

@@MekazaBitrusty I've heard it pronounced as Keishe as well

@MekazaBitrusty 3 жыл бұрын

@@alfredzanini well, Google just slapped me a new one. Apparently it is pronounced cash! English is so screwed up 🙄

@jimiscott 3 жыл бұрын

As you mentioned at the outset, the caching needs to be optimised for the workload. IBM's Z-Series is highly targeted to it's workload. The shared caching works when you've got a well-known (homogenous set of services). A (large) x86 server could be hosting VMs all doing different things and a workstation/PC a game, browser tabs, a video editor, a dev environment/tools and a whole raft of utilities and services....each of these actively fights for cache, invalidating older cache lines. The more running discrete services, the easier for cache lines to be invalidated....enough discrete active services and ay worse you almost have a FIFO stream of once used data. Dedicated core caches at least provide some barrier to other cores from effectively starving the owner core of it's own cache (though I am sure IBM have thought of this); therefore allowing threads (which are more inclined to always run on the same core) on the core to run without interruption from other cores/threads. Since Z-Series will have a number of known discrete services running on it at any one time, I would guess the OS will either auto-tune or allow to be defined the caching strategy to each of these services like it does the allocation of cores. x86 doesn't really have this luxury due to its broad, multi-purpose nature.

@TheStin777 3 жыл бұрын

Thanks for the explanation 👍. Perhaps in a few years this type of caching will come to x86. (I think) if it goes to arm 1st, x86 will be on deaths door if it doesn't come up with its own solution fast!!(as far as I understand the tech)

@orkhepaj 3 жыл бұрын

so how is this good for gaming?

@TheStin777 3 жыл бұрын

@@orkhepaj kzbin.info/www/bejne/iIKZnGCZhp5niMU

@alexanderzero5752 3 жыл бұрын

@@orkhepaj Video games are not usually optimized for multicore performance. So if someone with a PC was playing a game but not really using other programs at the same time, the L2 caches of the other cores likely wouldn't be used very much meaning that there wouldn't be a lot of cache misses on the virtual L3 cache, in addition to the fact that the core running the game now has a much larger L2 cache. Things might get more complicated for streamers though who do have some extra compute going on to process and encode their video/audio inputs.

@egg5474 3 жыл бұрын

Isolated cache is also more secure, as we saw with the release of zen3

@danieljones6778 3 жыл бұрын

Can't wait for the new round of spectre exploits based around this.

@CulturedThugPoster 3 жыл бұрын

Exactly what I was thinking.

@burdebc1 3 жыл бұрын

I don't want to see it happen, but it will.

3 жыл бұрын

exploits on Z? this isn't intel

@CulturedThugPoster 3 жыл бұрын

@ Yet the implications the good Potato points out are that Intel and AMD may well exploit this IP in thier own architecture, which implies the same problems with predictive branching that are at the heart of spectre exploits.

@sugo8479 3 жыл бұрын

If they do a bad enough job we might even see something that can be replicated outside of a lab.

@bernardfinucane2061 3 жыл бұрын

It seems like this would only make sense if the cores are out of sync. By that I mean if they all have similar loads then their L2 would tend to fill in parallel, reducing the virtual L3 available.

@kellymoses8566 3 жыл бұрын

The kind of software that IBM mainframes run really doesn't scale to many cores so I think it is a safe assumption. These systems are usually running many heterogeneous workloads.

@MaxIronsThird 3 жыл бұрын

Yeah, for games might be better to stick with a big L3 cache, bc there would be a lot of redundant shit on L2 from the fact that games are coded to be really good at multithreading.

@Freshbott2 3 жыл бұрын

@@MaxIronsThird maybe not though cause (not sure if he touched on this but I'm not gonna rewatch) it would depend on where you're processing it. The time it takes to kick something out of the core's cache wouldn't really matter while you're not processing it, but if you can just process it in its new adjacent core when you need it it would be faster than sending it back to the original core, and faster than jumping up and down layers of fixed cache. AMD has shown little bit of extra latency seems worth it to gain a lot of shared cache so this system should have the best of both worlds.

@JJFX- 3 жыл бұрын

@@Freshbott2 So in theory what gets kicked out to VL3 essentially becomes L2 again if the adjacent core is able to process it. I wonder if it would also be possible for something like Intel's big.LITTLE concept to sandwich their smaller cores between each 'pair' of L2...

@TheGraemeEvans 3 жыл бұрын

Yeah, depends on some cores being idle for their cache to be available for others to borrow.

@solidreactor 3 жыл бұрын

I loved the "Cache me outside, how about that" :D Imagine a Zen with each core having a 12.5 MB+ L2 cache, now that's a Gaming CPU for us "cacheual gamers" ;)

@vyor8837 3 жыл бұрын

Or 32mb of l3 and 4-8mb of l2

@paulgray1318 3 жыл бұрын

Cache lines matter ;).

@white_mage 3 жыл бұрын

@@vyor8837 the sole idea of having an l1 cache is to make things a lot faster. then theres l2 cache as a backup for the tiny l1 cache, and then theres the l3 cache which is huge but slow. i didn't even think an l3 cache could be important to have because i thought it was super slow when i realized it existed. i still think a massive l2 is better than having l3, but if a massive l3 is what i have, i won't complain. is still better than just having l2 but again, i'd rather have a massive l2. i think is faster than go checking l3 of l4.

@vyor8837 3 жыл бұрын

@@white_mage 12.5mb is smaller than what we have now for cache. Amount beats speed, power draw matters more than both. L2 is really power hungry and really large for its storage amount.

@white_mage 3 жыл бұрын

@@vyor8837 you're right, i didn't realize it was that big. if most of the imb chip shown in the video is cache then l2 has to be huge. what about l3? how does it compare to l2?

@Big_Tex 3 жыл бұрын

You’d think you’d want a huge amount of L1 but there’s a cache-22 …

@user-ty2uz4gb7v 3 жыл бұрын

🤦‍♂️

@EShirako 3 жыл бұрын

...that was an awful, naughty pun and you should feel bad. Stop that! STOP GIGGLING, DARN IT! o.@; *IsKidding, that was terrible-in-good-punny-ways pun, which means it was awful but a pun, so a good-awful! Does bet such a naughty punster is giggling, though!*

@pirojfmifhghek566 3 жыл бұрын

This joke has serious cachet.

@porina_pew 3 жыл бұрын

Two random thoughts: The virtual L4 to me implies that socket to socket access is still faster than socket to ram. Is that right? I guess that also considers NUMA. How does L4 compared to local ram? Recently we see Intel go relatively bigger L2/core while not so much in L3, and AMD is going bigger L3/core with relatively small L2. Wonder what the thinking is behind that design choice.

@SirReptitious 3 жыл бұрын

From what I remember, the reason AMD had the big L3 cache when they made Zen was so that no matter how many cores were being used, ALL of the L3 would be utilized to maximize performance.

@shadow7037932 3 жыл бұрын

Given their Z platform where they need shuttle data between CPUs all the time, it wouldn't surprise me if CPU to CPU is faster than CPU -> DRAM.

@winebartender6653 3 жыл бұрын

A lot would depend on the TLB structure and architecture, as well as what you're latency penalties are for going through the cache levels. He went into a lot of the decisions that can be made into what differences/decisions you can make with caches and their size. It's clear that AMD made an early decision in Ryzen's life cycle that L3 would be their low latency bulk storage when compared to DRAM. Considering the design inherently increases DRAM access time (relying on a GMI between cores and IO dies) they needed a way to make up that performance loss. Also, with their announcement of v-cache/stacked silicon, they have doubled down on this concept. My guess is that going to the chiplets design helped force their hand in this direction due to size constraints where it makes more sense to reduce the cache bottleneck at the die level as opposed to the core level. Also, you have to be very confident of core utilization to sacrifice die area for l2 in place of l3 as that space is wasted if the core isn't being used. This is where IBM's design is very interesting, but it would be interesting to see the latency of the virtual l3 in worst case (furthers cores apart). Maybe zen 4 or 5 brings stacking from the get go, allowing the base chiplet to have larger core specific cache (l1 and l2) the the upper die being used like v-cache for l3.

@kyoudaiken 3 жыл бұрын

Yes socket to socket links are proprietary and very fast. DDR4 and DDR5 are currently the biggest bottlenecks in any system.

@RobBCactive 3 жыл бұрын

@@winebartender6653 Also fetching data from DRAM costs more power, they were explicit about this when explaining the Infinity Cache for RDNA2. Not only a narrower bus to save die space but the cache improved power efficiency.

@OrangeEpsilon 3 жыл бұрын

We all probably want to see lots of graphs and diagrams that show all the performance characteristics :D The idea of using virtual L3/L4 caches inside massive L2 caches is really brilliant. I would really like to explore this in my future master thesis that I will start in about two years.

@RobBCactive 3 жыл бұрын

Is it so brilliant for the coder though? This will look like a massive distributed L3 with less predictable access times, rather than a typical L2 because the L2 is now inherently shared.

@kiseitai2 3 жыл бұрын

@@RobBCactive perhaps, but often times you end up working on legacy projects in which the stakeholder wants some refactoring done fast (often out of necessity). At that point, you don’t always have the bandwidth to tailor the code to the platform. Also, often, the code may be on an interpreted language (either moving from or to). That’s code you can hand tailor as easily for a cpu as C or assembly. My opinion is that a lot more code will passively benefit from this cache redesign than not.

@RobBCactive 3 жыл бұрын

@@kiseitai2 But it's inherently giving less control, the kind of code you're talking about relies on L3 anyway; it's not the kind of high performance code that tunes to L1 & L2. Besides you can query cache size and size accordingly for the chunks you do heavy processing on. It's not as hard as you make out to dynamically adjust. I just think people need to recognise that there's potentially more cache contention and it may not be reasonable to process data in half L2 chunks. I don't really see only advantages of a very complicated virtual cache system over L2 per core backed by a large L3 which follows data locality principals. There's been enough problems with SMT threads being able to violate access control already

@OrangeEpsilon 3 жыл бұрын

@@RobBCactive I think it is true that the access times will become less predictable because the size and partition of the L3 cache is dynamic. But what if we give the software some control over how to partition the virtual L3 cache? It could make the access times more predictable, but the OS could also move threads faster from one to core to another. Also a core could be put to sleep and then the associated cache can be configured as a non-virtual L3 cache. You see there are lots of things that make this interesting and could bring some advantages so that the disadvantage of unpredictable access times can be minuscule.

@abdulshabazz8597 3 жыл бұрын

@Fabjan Sukalia : Computational RAM is the future. Compute-in-Memory architectures will compete against and overtake Multicore designs- even those having gigabyte-sized caches.

@kungfujesus06 3 жыл бұрын

I do wonder if this, ever so slightly, increases attack surface by basically making every private cache line globally accessible. I wonder if there might be even better side channel attacks enabled by this.

@dfjab 3 жыл бұрын

The sad reality is, that the fasterl you make the CPU cores the more of such attacks will be possible. The industry might not be ready to admit it yet but speed is most likely not compatible with security.

@kazedcat 3 жыл бұрын

IBM could design it so that private cache is encrypted and the memory address is hashed and salted. This way a side channel attack might extract the private cache line data they need the keys to which should only be keep in the core to unlock the data.

@kungfujesus06 3 жыл бұрын

@@kazedcat I feel like the secure enclave concept tends to eventually be defeated. At least that was the case for Intel.

@kazedcat 3 жыл бұрын

@@kungfujesus06 The goal is not to make a bullet proof system but make the exploits extremely hard to implement. If each logical thread have their own unique keys and salts then exploits need to string together multiple vulnerability instead of needing only one. Also extracting keys from the core will be harder than extracting data from a shared cache. The problem with secure enclave is that it outsource security to a separate dedicated core instead of each core handling their own security. This reduce the attack surface but also makes security a lot more fragile. You only need to crack that one security core and everything becomes vulnerable.

@ear4funk814 2 жыл бұрын

@@kazedcat Or you can design and patent it if they haven't done so ... seems you have a solution to a real problem.

@CharcharoExplorer 3 жыл бұрын

Super fast L1 with high associativity. Massive L2 cache that can work as L3 at times the way IBM designed it. V-cached L2 or L3 and a chiplet with decent, massive L4. That would be godlike.

@erichrybinczak8894 3 жыл бұрын

Agreed. With the new packaging tech from Intel and and next few cpus can be some interesting designs testing out new techs and designs until the fine the right optimizations and features that make cpus 2x the current performance.

@crustybread9286 3 жыл бұрын

Virtual L3 /L4 amazing . Wonder how ccoherence is communicated across cores across chips, how virtualization of apps affects flushing/invalidation these virtual caches. Amazing engineering !

@julkiewitz 3 жыл бұрын

L2 / virtual L3 sounds more like something perhaps universally applicable. L4 sounds like something only useful in the context of the Z architecture. Consider that all those reliability guarantees must add quite a bit of overhead on the Z system. Not sure this will still apply to something like a dual socket x86. Still, the tech behind it must be absolutely impressive.

@nonae9419 3 жыл бұрын

1: I would rather describe the cache layout the other way round. The L1 cache is the one you work with, and anything you don't need is stored in RAM. This way the importance of cache is more pronounced, instead of sounding like just another 2% performance improvement. 2: Branch prediction is used for optimizing small loops (size dependent on µ-op cache size) and decreasing pipeline stalls. What you meant was cache prediction.

@deilusi 3 жыл бұрын

I hope that marking l3/l4 works, if not this seems like a meltdown madness. Even if that virtual l3/l4 won't be migrated to our CPU's for security reasons, Massive l2 would be interesting to see.

@petergerdes1094 Жыл бұрын

This sounds great for mainframe style applications but sounds like a nightmare for any use case where you need to worry about side channel attacks from other processes.

@jacobpalecek7439 3 жыл бұрын

Thank you for your amazing content Ian. Even as a computer scientist (albeit focused on software only), I still learn new things from most of your videos!

@Snotkoglen 3 жыл бұрын

That Linode ad with the buff American accent, cracks me up every time. :)

@FreeloaderUK 3 жыл бұрын

Ah the American accent which slips into an English accent every few words.

@hammerheadcorvette4 3 жыл бұрын

😆😆😆

@Grass89Eater 3 жыл бұрын

I think it is might work very well when not all 8 cores at the same time need to use their entire L2 cashe. Games usually has very uneven workloads between different threads, so it will probably work great for games.

@benjaminoechsli1941 3 жыл бұрын

16:21 The good doctor is at a loss for words at how huge and cool that is. Love it.

@deechvogt1589 3 жыл бұрын

As always Ian. Very fascinating. Keep the awesome content rolling.

@Supermangoaliedude 3 жыл бұрын

I worked on the memory coherency unit for these core+cache units. It was fascinating to learn about and glad I could be apart of implementing it.

@HexForger 3 жыл бұрын

12:32 Well, simply put this complements mainframe setup perfectly. I mean, even older MVS concepts like shared/common virtual memory, or the coupling facility, they all enabled Z to perform like no other infrastructure while still maintaining a high degree of parallelism and ridiculously low IO delay. Ironically, despite the "legacy" label being often attached to mainframes by the press, Z still remains one of the most sophisticated and advanced pieces of technology, to this day improved by each iteration.

@abzzeus 3 жыл бұрын

I'm old enough to remember when ARM was used in RISC PCs, they went from ARM610 40MHz to StrongARM 203 with 16Kb of cache, the speed up was way way more than what the raw MHz would indicate. The BASIC interpreter now fitted in cache. Programs ran so fast it was insane. Some games had to be written/patched as they were unplayable. All down to the cache.

@estebanguerrero682 3 жыл бұрын

Thanks for the explanation it is amazing to see this new approaches to latency and allocation :O

@Winnetou17 3 жыл бұрын

This is nice. This both allows L2 and "L3" and "L4", in bigger amounts than before, but also the opportunity for a single threaded workload to have a very nice very big 32MB of L2.5 cache. And still more L3 and L4, if the other cores are barely doing stuff. Huge increase either way.

@TreborSelt 3 жыл бұрын

I'm so glad I found your Channel. It's the only way I can get info into the deep-dives of the tech industry. 😎

@Zonker66 3 жыл бұрын

That was very comprehensive. Never understood cache that well. Thank you.

@Netsuko 3 жыл бұрын

I just wanted to say that your videos are fantastic! Really well explained and educational to watch. I never understood what L1, 2 or 3 Cache ACTUALLY did. Until now.

@superneenjaa718 3 жыл бұрын

That's genius, but the potential problem I see here is that how are they pushing out old data? We won't want cases where a core doesn't have enough space on it's private cache cause it's full. So it goes to store elsewhere, but it's private cache was filled with old data from another cores which would not be used much again. A good amount of planning needs to go in there (which I suppose IBM engineers already did). However, I wonder if the virtual L4 helps at all. Shouldn't it be as slow as memory, since it's outside of the chip anyway?

@AdamBrusselback 3 жыл бұрын

There are probably quite fast interconnects between the chips in the system, likely lower latency than RAM still otherwise they wouldn't bother.

@orkhepaj 3 жыл бұрын

probably they can throw out data from other cores when they need the space

@norili7544 3 жыл бұрын

If a cache is full, the oldest data gets evicted. My understanding is that the caches will prioritize in-core usage as L2 over out-core usage as L3.

@countach27 3 жыл бұрын

This really is impressive, unless all cores are heavily utilized you have almost infinite amount of cache, crazy, crazy

@TheMakiran 3 жыл бұрын

Considering having consistent 100% utilization is almost impossible, this is very cool

@fteoOpty64 3 жыл бұрын

@@TheMakiran Ian can tell you his Pi machine will certainly run very close to full bore on all cores!.

@Innosos 3 жыл бұрын

I'm impressed that you seemingly have the money to buy a multi socket system. Us mere mortals might as well wait another decade before these systems become affordable, at which point, I presume, the requirements for common applications will have appropriately expanded to fill that "infinite amount of cache" rather quickly.

@humpadumpaa 3 жыл бұрын

@@Innosos I'd say the equivalence to sharing the L4 cache between processors would be to share it between chiplets/tiles instead.

@hammerheadcorvette4 3 жыл бұрын

At 5.3Ghz, they are hoping to "clear" the L2 as fast as possible. That is reasonable as most of these machines are going into the Finance sector.

@jk-mm5to 3 жыл бұрын

The L3 cache on my old 7900x has the same transfer rate as my DDR4 main memory. It seems that traditional L3 cache is obsolete with today's faster RAM.

@emperorSbraz 3 жыл бұрын

L2 used to reside on the motherboard. I had one with a cache SLOT that fitted an extra 256KB on top of the onboard 256KB. :)

@SirReptitious 3 жыл бұрын

Hmm, I don't remember ever seeing a motherboard with that feature! All the ones I recall using had two 256k SRAM chips soldered to the motherboard. And for some reason it was common enough to have the L2 cache go bad that the BIOS had the ability to bypass the L2 cache for when that happened, albeit with greatly reduced performance. And of course you will remember that back then too the memory controller wasn't in the CPU but the Northbridge. So we had a situation that today sounds insane; When the CPU needed data it would check it's L1 cache, if not there it would check the(external) L2 cache, and if not there it had to ask the memory controller on the Northbridge to go to ram to fetch the data and send it to the CPU. It sounds so barbarically crude now it's almost funny. But back then in the stone age process nodes were so huge there just wasn't room to put everything on the CPU. And if you wanted integrated graphics so you didn't have to buy a graphics card, did you buy a CPU with a GPU built-in? Of course not, that was also the job of the Northbridge too! LOL! There are a LOT of things I can say I miss about "the good old days" but that stuff is NOT part of it! ;-)

@robwideman2534 3 жыл бұрын

That was called a COAST module. Cache On A STick!

@Variarte_ 3 жыл бұрын

This kind of feels like an alternative and an extension to AMD's paper 'Analyzing and Leveraging Shared L1 Caches in GPUs' except for CPUs

@rattlehead999 3 жыл бұрын

This also reminds me of the core 2 duo, where it had big L2 cache for the time.

@RobertVolmer 3 жыл бұрын

Fun fact. The K6-III and K6-III+ at 1:41 did have 256KB of L2 cache on die.

@chinesepopsongs00 3 жыл бұрын

This sounds like a more practical way of using space. If you loose some cycles on the L2 but the gain is the L3 is faster and you save space so a bigger L1 could be done also. A bigger L1 and a faster virtual L3 might cancel out the latency loss of the slightly slower L2. We will see if Intel and AMD start to adopt this new idea.

@RobBCactive 3 жыл бұрын

The video doesn't point it out but the different caches use virtual or physical addresses. Code uses its virtual addresses to run at full tilt, but to share cached memory between processes then a physical to virtual address translation using the process memory mapping is required. This is the reason L1 cache can be very low latency, while L2 & L3 are slower. Arguably IBM appear to be using a distributed L3 cache from the size, if you're coding high performance code, you really want to fit in the core's L2 for the main work, accessing L3 via RAM in predictable ways so the processor can anticipate and pre-fetch.

@AtariKafa 3 жыл бұрын

i am pretty sure IBM can save GPU market but they have more important things to do they always solve very very big problems all the time

@FutureChaosTV 3 жыл бұрын

IBM designs systems while AMD and Intel design products. That's how I would classify the difference.

@recai4326 3 жыл бұрын

Aga sen napıyon burada ya :D

@henryD9363 3 жыл бұрын

Apple M1 seems to be efficiently addressing that arena. And M1's just the first version.

@robertutter1763 3 жыл бұрын

It's pretty awesome that you used an AMD K6 III+ in your video. It's the first CPU to use level 3 cache. Of course it was off chip level 3 cache. But it was huge compared to what was available in level 2 cache of the time.

@_DarkEmperor 3 жыл бұрын

Think different way, if AMD can put large L3 Vcache on top of processor die, it can redesign all on-die L3 cache to L2, You could have huge L2 cash and huge L3 ache.

@billkillernic 3 жыл бұрын

Just a technicality correction that didn't sit well on me (1:53) They don't start small because they are really fast, they start small because they are really, really close to the CPU cores and the circuitry needs to be really, really short in order to be that fast which limits the available space that fits the two aforementioned criteria.

@justdoityourself7134 3 жыл бұрын

Per Jim Keller, ~ "performance can be reduced to branch prediction and data locality" ( And I Paraphrase ). Best most concise rule of thumb for all hardware and all software design IMO.

@hammerheadcorvette4 3 жыл бұрын

Memory is still the limiting factor in core performance. That has brough on this type of innovation and 3D stacking.

@justdoityourself7134 3 жыл бұрын

@@hammerheadcorvette4 that is what data locality means. How close the data is. E.g cache, memory, disk, network...

@Game__Boy 3 жыл бұрын

Love the choice of a K6-III+ 400mhz ATZ in your graphic at 1:24 -- I'm using one of those in my super socket 7 vintage gaming PC, what a neat processor.

@Veptis 3 жыл бұрын

Love the educational part. Really excited for the upcoming technology. But IBM processors really don't end up in consumer products, right? However the technology might

@henryD9363 3 жыл бұрын

Apple M1 does not have an L3 cache. CPU, GPU and all the RAM or on the same package. Close together. No connectors, pins, sockets and copper traces on a motherboard. Very similar.

@jecelassumpcaojr890 3 жыл бұрын

The Tilera chips already used this idea - the private L1 caches could also be used as part of a chip-wide L2 cache. I myself have used this in some of my projects since 2001. Some of the ideas that make this work can be found in en.wikipedia.org/wiki/Directory-based_coherence

@VRforAll 3 жыл бұрын

That's 20yrs ago, meaning key Patents may have expired. If so, other companies could start working on their own implementation.

@Girvo747 3 жыл бұрын

“Possible” that IBM has patents out the wazoo? Hahaha it’s guaranteed that they do. When I partnered with them on a homomorphic encryption system I built, they tried to convince me to patent it, despite being a pretty simple implementation of the Pallier algorithm lol

@robinbinder8658 3 жыл бұрын

casually flexing he partnered with IBM like bruh

@S41t4r4 3 жыл бұрын

They also flex about how many patents they applied for a year and how many they have in total on presentations.

@robinbinder8658 3 жыл бұрын

@@S41t4r4 my achivements this year: - once ate a whole family size pizza alone get on my level

@S41t4r4 3 жыл бұрын

@Mahesh If you sold your product before the patent was admitted the other party either can't patent it , or looses the patent pretty easily

@Girvo747 3 жыл бұрын

@Mahesh depends if the jurisdiction is first to file or first to invent

@jeffwells641 3 жыл бұрын

A really nice analogy for caches is the Librarian. He keeps a handful of the most recently requested books on top of his desk, a stack of recently used books underneath his desk, a rack behind his desk with the most popular books in the last few days, and finally the main library where the books' are stored more permanently. As the Librarian gets a book request, he puts the book on his desk and moves the oldest book under his desk, then moves the oldest book there back to the shelf behind him, and then puts the oldest book on the shelf back into it's place in the library. Here is easy to see if the librarian keeps too many books on his desk he's going to have difficulty getting his work done - the books will be in the way. If he's got too many books under his desk it will slow him down as well, but he's got more room to arrange this area. The shelf behind him had even more flexibility in it's size and arrangement. That's a CPU and its L1, L2, and L3 caches, plus its main memory.

@Bengt.Lueers 3 жыл бұрын

This is truly an amazing technology, and Ian's enthusiasm for it is palpable.

@erichrybinczak8894 3 жыл бұрын

I wonder with newer package technologies are we going to go back to the pent-up pro or pent-up designs with off die cache. With the newer interposed larger l3 or l4 can be on package using newer or other NM processes that can make cache faster or more power efficient.

@CjqNslXUcM 3 жыл бұрын

AMD 3d v-cache is apparently twice as dense as the on-die l3, because it uses a library optimized for cache, on the same 7n process.

@fvckyoutubescensorshipandt2718 3 жыл бұрын

They really need to hurry up and make dies cube shaped with microholes in it for liquid cooling (millions or billions of layers instead of just 17). Cube vs square = both more cache and lower latency.

@notthedroidsyourelookingfo4026 2 жыл бұрын

2:11 I don't think that statement is correct. Isn't branch prediction more about pipeline utilisation than memory access? As in: the instructions for *after* a predicted jump get decoded *before* the jump condition is evaluated. Cache utilisation is more about data locality, as far as I understand it.

@almostmatt1tas 3 жыл бұрын

This really came at the right time, I've been learning about cache lately and I didn't realise how much I didn't know :) I really thought it was a simple as more cache = more better.

@lahmyaj 3 жыл бұрын

Not gunna pretend to know chip design too much but from the stuff I watch, Apple’s M1 is really good because it keeps its cores fed due to its design approach (the whole 8x instruction decoders, large caches etc. ). Is this IBM approach not similar in that it attempts to look at the problem differently than others and looks to utilise its resources better/more efficiently rather than just adding more resources?

@henryD9363 3 жыл бұрын

And... the M1 also does not have an L3 cache. Seems to work very extremely well

@DaxVJacobson 3 жыл бұрын

Make me think of Intel's use of register renaming, technique that abstracts logical registers from physical registers, plus Intel loves cache on their chips, this already sounds like it's an Intel Technology.

@RobBCactive 3 жыл бұрын

Everyone loves cache, because memory access is relatively much slower compared to CPU cycles. Intel have reduced cache sizes (P4) and love market segmentation with cache being a differentiator.

@WhenDoesTheVideoActuallyStart 3 жыл бұрын

I regret to inform you that no, caches are in fact not "an Intel Technology", they're quite... widely used, to say the least.

@DaxVJacobson 3 жыл бұрын

@@WhenDoesTheVideoActuallyStart There used to be a web site (behind a paywall I forget it's name) that would talk about problems with Intel CPU manufacturing process (before the +++++ stuff) and that site would always talk about Intel always wanting to be a memory company so their half ass processors always had a ton of cache to paper over what a crap processor they made, this being a technology to add a CRAP ton of cache to a CPU thus sounds like an Intel technology to me, to paper over the fact Intel is a crap CPU company that is a monopoly and the only thing keeping it in business is monopoly power and cache memory to make it look like an innovator.

@WhenDoesTheVideoActuallyStart 3 жыл бұрын

@@DaxVJacobson Every modern processor has shittons of on-die cache, because they're all wide 5Ghz data crunching machines, while the basic DRAM cell hasn't changed much for decades (Besides the shrinking), and still has the same old access latency, which is monstruous when compared to modern on-die SRAM. Every year that passes DRAM is more and more of a restriction to the performance of CPU's, and pre-fetching as much data as possible into faster SRAM caches is one of the only solutions besides magically eliminating Amdahl's law or having STT-MRAM catch up with the 5nm node.

@hammerheadcorvette4 3 жыл бұрын

@@DaxVJacobson Companies have been trying to add a CRAP ton of cache to a core since the Arm days of the 80's. Intel sold many Haswell processors with 256MB of cache to the financial market.

@joxer 3 жыл бұрын

Ok as someone who worked in IBM database systems (Db2 for Linux and Windows for nearly 30 years, a couple more in SQL/DS), a mainframe isn't the only way that your banking transactions can be kept safe. This is software, specifically the write ahead logging of your transactions and then adding things like dual logging, log shipping for high availability disaster recovery, fail over for nodes etc. We would commonly have 32 to 256 database partitions on 4 to 64 physical machines presenting to the end user as a single database for instance. On with the show.

@rattlehead999 3 жыл бұрын

As per usual in the last 12 years, it's IBM doing the major innovations.

@katietree4949 3 жыл бұрын

Diggin that Konami Code brother, also, I came here from a Gamers Nexus video, and I gotta say you've earned a new subscriber.

@Psychx_ 3 жыл бұрын

And here I was, thinking that future AMD designs might incorporate a L4 cache on the IO die. Now I am not so sure about that anymore as there seem to be more options.

@TheStin777 3 жыл бұрын

That would seem like the logical next step offer 3d stacking. (As far as I am understand it) but making a 'virtual' cache pool will be a LOT faster than L4, especially if it is like this with L2. Increasing size but drastically reducing latency.

@Psychx_ 3 жыл бұрын

@@TheStin777 It's all workload dependent though. The IBM Z series chips are tailored towards financial computing.

@IttyBittyVox 3 жыл бұрын

Virtual cache sounds interesting, but I think I'd need to see examples of how cache eviction works in practice for some realistic loading scenarios before I get excited. Particularly for workloads we think it'll help in. For example does this design cope well with games that would benefit more from larger dedicated shared cache? Or does the design mean that the virtual L3 and L4 is squished out to make room for L2 when that wouldn't be optimal for the load in question?

@Bengt.Lueers 3 жыл бұрын

Comment for the algorithm

@CaseAgainstFaith1 3 жыл бұрын

I had no idea that "mainframes" still exist. I thought modern banking systems simply used stacks and stacks of standard server chips. I had no idea there was a whole different processor design for modern mainframes.

@evenblackercrow4476 2 жыл бұрын

Thanks. Great explanation of cache and its relevance. I like your term 'workload profile' and its application to the flexibility of virtual cache.

@MichaelPohoreski 3 жыл бұрын

Whoa! This is the same Dr. Ian that got AMD to price the Threadripper 3990X at $3990. Subbed!

@robertalker652 3 жыл бұрын

You did a marvelous job in explaining and simplifying the subject matter, and I think many a layperson who has even a minimal grasp of system\processor architecture would not fail to grasp what you've presented.

@n.butyllithium5463 3 жыл бұрын

I like to toolbelt analogy for cache. The CPU registers are the tools in your hands, and you work directly with them. L1 is your toolbelt, bigger, but slower. L2 is your toolbox, you need to restock your toolbelt from there but it can hold even more tools. L3 is your truck. Lots of storage but you need to go all the way back to your truck. DRAM is the hardware store, almost all the tools available, but very far away. You lose hours just going over there and back. The Hard Drive is China, absolutely all tools, but takes months to get the one you ordered.

@shavais33 3 жыл бұрын

Super cool idea on IBM's part. So good to see there are still active and commercially viable processor designers out there besides Intel and AMD. When I went through my Computer Engineering undergraduate work, I learned about the layering process when manufacturing silicon chips. Expose the surface to a gas, let it dry and form a masking layer, shine a light through a film mask and through lense onto the surface to burn a pattern through the masking layer, expose that to a doped gas to add doping to the exposed silicon, use an acid to remove the mask layer. rinse and repeat with a different mask and maybe a different doping for a different purpose. At the time, they were doing 7 to 13 layers. I think often there is a grounding layer on the bottom and a power layer on top, with holes poked through the grounding layer for signal leads. I think adding more layers had diminishing utility because of cross talk (flowing current induces a magnetic field around it, which can, in turn, induce current flow in nearby conductors, there by interfering with any signals going through those nearby conductors) and heat dispersal (current flowing through a conductor that has some resistance produces heat and consumes power). It seems like both of those issues are a matter of materials research. If you can get very high resistance in a material that is achievable through a gas exposure process that conducts heat very well, you can disperse heat and prevent cross talk at the same time. And if you can get very low resistance, you can get more current through with less voltage and not only generate less heat and save power, but also charge up capacitance faster and thereby increase signal speeds. If you could get a semiconducting material that was super conducting when/where it is in its conducting state and has very high resistance (enough to basically completely block current flow) in it's non-conducting state, that would really win the game, because your power requirements would be incredibly low, your signal speeds incredibly high, and there would be no heat to disperse. I guess that would be the holy grail for this stuff. Anyway, the point is, it seems like a lot could be achieved by materials research to reduce heat and cross talk and then by adding a lot of layers. Imagine having a cube instead of a flat die. Right now die's are 100mm squared, and about 0.5 mm thick. Some are 0.275mm thick. If you had a die that was 100 mm cubed.. that could potentially theoretically multiply the transistor count by up to 200 or so. Of course it would also take many times longer to produce a wafer, so I guess that's an enormous cost problem. Which I guess probably means that nobody is working on anything like that. But when our processor chips are super-semi-conducting and are cubes (or honey-comb shaped?) instead of flat chips, that's when AI robots will speak fluently and be indistinguishable from humans over audio-only communication channels and be capable of taking over the work of world class human doctors, scientists, artists and engineers and doing a better job of those things.

@AlexanderKalish 3 жыл бұрын

In my understanding it would only work if the records are being evicted from the caches not only due to being replaced by the newer data but also due to some sort of time of life eviction and possibly other methods because if you only have eviction by replacement with newer records then all L2 caches will be full after a short period of time and there will be no space in any of them to be allocated as virtual L3 cache. Also this time of life and other kinds of eviction should not evict it to some other core's cache but rather erase it because otherwise you'll just have all cores ping-ponging the same old data.

@capability-snob 3 жыл бұрын

Lines are on a queue to be written back to RAM before they leave L1. You can safely discard anything in cache on most designs without impacting correctness. The hard part tends to be picking which lines to discard 🙃

@AlexanderKalish 3 жыл бұрын

@@capability-snob well, the first thing that comes to mind is time of life since the last use of the line.

@nghiaminh7704 2 жыл бұрын

At 13:08: Intel Xeon is ~3Ghz; with 12 cycle latency is 12/3=4 something something. z16 is 19/5.2=3.6. Then why is Intel's cache considered faster?

@JoneKone 3 жыл бұрын

cache wars are on full swing.

@Friend_of_the_One-Eyed_Ladies 3 жыл бұрын

You explain things really well. Thumbs up.

@D.u.d.e.r 3 жыл бұрын

Cache is ultra important for the future CPUs and your highlighting of eliminating L3 done by IBM was something what Apple did long time ago with their Axx ARM based CPU's where they completely dumped L3 and enlarged L2. Having virtual L3 cache out of L2 cache is definitely way to go, but we are still not where we should be with the system memory which runs as fast in clocks as CPU itself. Future computers and CPU's need to have memory as fast as CPU itself to achieve greatest possible efficiency and ideally if that memory would be also non volatile. Having RAM memory fast as fastest cache with capacity and non volatility sounds like a sci-fi, but its what we would need for the future. Thank you for highlighting this new design from IBM, its really something special and proves again that IBM is superb CPU design company!

@ragincaveman 3 жыл бұрын

Tightly coupled local data stores are one way of reducing memory pressure. Can be private or coherent shared public depending on the workload and application.

3 жыл бұрын

My take - virtual L3 and L4 is there just for compatibility reasons, most things will only use L1 and L2 or will have to go to main memory anyway

@paulgray1318 3 жыл бұрын

IBM loves to sell chips with say 8 cores, even if you get a 4 core system as you can enable those other cores at a later time after paying IBM more money. Now the question is - will those unlicensed inactive cores also disable the l2 virtual l3 cache upon those inactive cores or will the way IBM locks cores off still leave that L2 cache exposed and enabled for the virtual L3 cache? What I find interesting about IBM's is the whole memory controller approach and was in effect reprogrammable for future memory interfaces, that's pretty epic.

@selohcin 3 жыл бұрын

Incredible. We need this on consumer platforms RIGHT NOW.

@dasberserkr 3 жыл бұрын

This concept reminds me of a paper I wrote on virtual scratch pads back in 2012... at the time there was also the concept of virtual caches... which allows you to borrow space from on-chip memory across other cores... very interesting to see IBM use something similar here... btw i like how the timelines show up as you speak, i will need to learn to do that for my videos :)

@ear4funk814 2 жыл бұрын

You may need to check with a patent lawyer ... this is obviously a "big bucks" feature, and if you had mentioned "virtual caches" back in 2012, then the idea is public domain such that others can freely use the idea ... or maybe you are entitled to royalties as well (I'm sure IBM has patents issued on this).

@petergerdes1094 Жыл бұрын

@@ear4funk814Why would they? They obviously don't have the patent and the only question is whether AMD or apple etc gets sued for patent infringement at which point they'll do a paper search for prior art.

@YouHaventSeenMeRight 3 жыл бұрын

I think you forgot about registers in your cache level story. Most modern GPU's have a large register section for the immediate access to data. L1 cache has typically 3-5 clock cycle delays, registers are immediately available. Typically the register section of a GPU is much larger than that of a CPU, so it does come into play more than registers do for CPU's.

@Mammothtruk 3 жыл бұрын

cache to VL3 means throwing it from one memory controller to another and a cycle taken to do so, which means two memory controllers would have to wait a cycle to do anything else. I can see the benefit, but also possible issues. cross talk might stall other things til it is done and possible attacks to force a leak of memory from one to another to then be read? be interesting to see where they all go with all the new cache things coming around.

@jpdj2715 3 жыл бұрын

You omit one thing: the type of memory cells used for these caches. In 1995, IBM had a UNIX (AIX) server that used static RAM for cache and it could be expanded. In line with your point, throwing more cache into a base model made it faster than the next up model with a more powerful CPU. The nature of D-RAM is that it is unstable and slow. That static RAM had less that 1/8th the latency of DRAM. And it was very expensive - probably because of very low production volumes. I always wondered why the DRAM in my PC had not gotten replaced by static RAM in the past decades. Oh, mainframes - next to billions of smartphones and PCs, and a couple Macs and iPhones, there were about 10,000 IBM mainframes in use in 2019.

@nedegt1877 3 жыл бұрын

I think 3D Cache or Stacked Cache is going to be the future for a couple of reasons. 1. Space. Like you said in the video. Stacked Cache allows bigger and faster Cache, 4GB is doable. 2. Price. The logic behind Virtual Cache systems is more expensive than 3D Cache. 3. Applications. There are tons of ways you can create very fast chips using 3D design. Imagine a 3D ( Stacked Chip) that's 3 layers. 1st layer for CPU's, 2nd for Massive Cache, 3rd GPU's. That's a very fast and crazy chip having both CPU and GPU using the L3 Cache in the middle. That's some pretty awesome power you get in 1 chip package for the majority of workloads. But the best idea would be a hybrid 3D Virtual Cache. Eventually you'll see CPU's getting GB's Of very fast memory, be it L2, 3 or 4.

@unvergebeneid 3 жыл бұрын

Why does "searching through" a larger memory take more time? What's the mechanism there? Is that delay in the LUT or in the actual memory addressing?

@BRNKoINSANITY 3 жыл бұрын

Picture it like an index. It is hunting through the "terms/references", and a larger book will have a longer index. It's got to find the location of the data in that index before it can retrieve it.

@capability-snob 3 жыл бұрын

The signal takes time to travel through a physically larger cache, and then to travel back. Were at the sort of clock speeds now where even at the speed of light, an unimpeded 5GHz signal could only cover 6cm in a cycle. Add to that, driving busses with many outputs gets inefficient quickly and drives down the slew rate on the signal, so there is a few levels of muxing between the query and the actual lines. I bet finding the balance with each new process is quite the challenge.

@redtails 3 жыл бұрын

I'm not getting a clear answer whether the shared l2 is faster compared to l3. if the shared l2 is on another chip, it might as well be slower than l3 is, or maybe even comparable to ram. would be of interest to just have a table if scenario versus latency

@martylawson1638 3 жыл бұрын

Afik cache uses a fixed mapping between main memory addresses and cache addresses to maximize search speed. Which means that useful data is never packed densely and there are lots of gaps filled with vacant cache memory all the time. Sounds like an excellent resource to turn into a virtual L3/L4.

@neutechevo 3 жыл бұрын

There seems to be a right time for each approach and design to work. IBM seems (always) to be right at the tip of the spear , that is innovation. At earlier times such Large pools without some higly intelligent mechanism couldn't make sense. If i am not mistaken IBM also had a connection with Global Foundries and the first Zen iterations , the whole CCX , Core , Cache Structure , i think was influenced by the trends of the μ-arch at the time and each time IBM seems to be there. Also it seems that Intel is starting to make more larger L2 structure with the new cores..maybe going towards the approach? I hope your channel keeps growing , it transcribes to me some good vibes and has a feline feeling to it..

@tomcarlson3913 3 жыл бұрын

One thing to think about is the virtual caches will never be as big in practice as they are in theory. L2 in use subtracts from available L3 and both L2 and L3 subtract from L4. On multi-core loads with heavy L2 usage, L3 and L4 could effectively become tiny. Depending on the cache management software the model where cache size drops with cache number could be turned on it's head and cause problems. The cache management software's design and performance will be a key player in making or breaking this technology in the field.

@justdoityourself7134 3 жыл бұрын

This is super context sensitive. For example on normal desktop workloads, more L2 would be great... but mainly if there were less cores. Because there would be more task swaps. But with higher core count the scheduler remembers the process or thread affinity for this very reason to keep the specific core cache hot for that task. This is so complicated. I can't really imagine it being a guaranteed improvement. Very interesting, lots of power vs performance vs workload vs software optimization type considerations.

@blazbohinc4964 3 жыл бұрын

Hey Ian. What do you think about PrimoCache?

@uncurlhalo 3 жыл бұрын

I work in a DC for a bank and they just installed a load of the newest model of Z Mainframes, these things are sick, and some of them even have liquid cooling built in. Amazing machines.

@123argonaut 3 жыл бұрын

"virtual cache"... I'll be darned. I knew nothing about computers, but now I know about "virtual cache"... cutting edge stuff.

@SalvatorePellitteri 3 жыл бұрын

Do you think HBM clould be used as massive 10's of GB of L4?

@law-abiding-criminal 3 жыл бұрын

Are L1 L2 and L3 caches searches simultaneously are after each other? The one solution is faster while the other is more energy saving

@henryD9363 3 жыл бұрын

Good question. I wish I knew the answer. I assume that L2 is searched only if there's a miss on L1. Similarly L3 searched only if miss L2. In a different KZbin video someone said that IBM systems don't care anything at all about energy. So I think that would hold here.

@kazedcat 3 жыл бұрын

The search is not very time intensive. Majority of the cache latency is cause by wire impedance. So the amount of time you save by simultaneous search is very small and you are stuck waiting for the signal to traverse through the wires.

@emscape2 3 жыл бұрын

I knew someone would come up with this at some point. Props to them

@D.u.d.e.r 2 жыл бұрын

This is an excellent report and overall analysis Ian, well done!👍 IBM is really creating something special with Telum and its approach even quite different with the virtual, shared L2 cache reminds me how Apple ditched the slower L3 cache and rather expanded the L2 cache in their newer AXX chips. Still IBM is coming with something special again which might inspire both team blue and red as well as Arm. Regarding cache itself, it would be awesome to have one day GBs of high speed cache on the SOC package which would completely replace the DDR and even HBM. Its quite clear that future of the memory is to get it run much faster so that RAM is running almost as fast as cache and non volatile memory like Flash runs as fast as RAM. Ideal world would be to have one single, unified memory which runs as fast as the fastest Flash, but it's non volatile😁

@joechang8696 3 жыл бұрын

in the multi-core era, Intel processor cache were inclusive, so whatever is in L2 is also in L3. On L2 miss, it is only necessary to look in the shared L3 without worrying about what's in other cores (private) L2. With Skylake (SP), L2/L3 became exclusive (AMD did this earlier?). On initial load, memory goes straight to L2. On eviction from L2, it goes to L3. To implement this, each cores' L2 tags is visible on the ring. This being the case, L3 is no longer particularly necessary - however, it may be desirable to keep L2 access time reasonably short.

@Sunny-gt8zi 3 жыл бұрын

Thank you so much for going through these more deliciously geeky topics that others are not informing us about on youtube! I really appreciate it :D

@woswasdenni1914 3 жыл бұрын

where aare the FPS comparison charts for that mainframe, asking for a friend

@Angmar3 3 жыл бұрын

Nice video, a great explanation of how cache works

@deez6005 3 жыл бұрын

Love your in depth content. Love the channel. Keep up the quality work

@joe7272 3 жыл бұрын

that's a very good strategy for "big data". Memory prediction accuracy is likely much higher too meaning it takes full advantage of the size.

@congchuatocmay4837 Жыл бұрын

Caching is very important if you want to do large fast Walsh Hadamard transforms. Which you might for SwitchNet.

@KaosArbitrium 3 жыл бұрын

This has a lot of potential. I'd be a bit worried about the tracking accuracy and the security, but so long as they've kept such things in mind, this should work out amazingly.