Cash is super important to performance as you can't buy anything without it. Cache is also important. You need cash to buy more cache :P
@AndreVanKammen3 жыл бұрын
My fortune is mostly write through cash, so not enough money to buy cache.
@TechTechPotato3 жыл бұрын
It comes in from the day job, falls through your pocket, right into your bills
@MekazaBitrusty3 жыл бұрын
I believe that cache is pronounced caish, not cash.
@alfredzanini3 жыл бұрын
@@MekazaBitrusty I've heard it pronounced as Keishe as well
@MekazaBitrusty3 жыл бұрын
@@alfredzanini well, Google just slapped me a new one. Apparently it is pronounced cash! English is so screwed up 🙄
@jimiscott3 жыл бұрын
As you mentioned at the outset, the caching needs to be optimised for the workload. IBM's Z-Series is highly targeted to it's workload. The shared caching works when you've got a well-known (homogenous set of services). A (large) x86 server could be hosting VMs all doing different things and a workstation/PC a game, browser tabs, a video editor, a dev environment/tools and a whole raft of utilities and services....each of these actively fights for cache, invalidating older cache lines. The more running discrete services, the easier for cache lines to be invalidated....enough discrete active services and ay worse you almost have a FIFO stream of once used data. Dedicated core caches at least provide some barrier to other cores from effectively starving the owner core of it's own cache (though I am sure IBM have thought of this); therefore allowing threads (which are more inclined to always run on the same core) on the core to run without interruption from other cores/threads. Since Z-Series will have a number of known discrete services running on it at any one time, I would guess the OS will either auto-tune or allow to be defined the caching strategy to each of these services like it does the allocation of cores. x86 doesn't really have this luxury due to its broad, multi-purpose nature.
@TheStin7773 жыл бұрын
Thanks for the explanation 👍. Perhaps in a few years this type of caching will come to x86. (I think) if it goes to arm 1st, x86 will be on deaths door if it doesn't come up with its own solution fast!!(as far as I understand the tech)
@orkhepaj3 жыл бұрын
so how is this good for gaming?
@TheStin7773 жыл бұрын
@@orkhepaj kzbin.info/www/bejne/iIKZnGCZhp5niMU
@alexanderzero57523 жыл бұрын
@@orkhepaj Video games are not usually optimized for multicore performance. So if someone with a PC was playing a game but not really using other programs at the same time, the L2 caches of the other cores likely wouldn't be used very much meaning that there wouldn't be a lot of cache misses on the virtual L3 cache, in addition to the fact that the core running the game now has a much larger L2 cache. Things might get more complicated for streamers though who do have some extra compute going on to process and encode their video/audio inputs.
@egg54743 жыл бұрын
Isolated cache is also more secure, as we saw with the release of zen3
@danieljones67783 жыл бұрын
Can't wait for the new round of spectre exploits based around this.
@CulturedThugPoster3 жыл бұрын
Exactly what I was thinking.
@burdebc13 жыл бұрын
I don't want to see it happen, but it will.
3 жыл бұрын
exploits on Z? this isn't intel
@CulturedThugPoster3 жыл бұрын
@ Yet the implications the good Potato points out are that Intel and AMD may well exploit this IP in thier own architecture, which implies the same problems with predictive branching that are at the heart of spectre exploits.
@sugo84793 жыл бұрын
If they do a bad enough job we might even see something that can be replicated outside of a lab.
@bernardfinucane20613 жыл бұрын
It seems like this would only make sense if the cores are out of sync. By that I mean if they all have similar loads then their L2 would tend to fill in parallel, reducing the virtual L3 available.
@kellymoses85663 жыл бұрын
The kind of software that IBM mainframes run really doesn't scale to many cores so I think it is a safe assumption. These systems are usually running many heterogeneous workloads.
@MaxIronsThird3 жыл бұрын
Yeah, for games might be better to stick with a big L3 cache, bc there would be a lot of redundant shit on L2 from the fact that games are coded to be really good at multithreading.
@Freshbott23 жыл бұрын
@@MaxIronsThird maybe not though cause (not sure if he touched on this but I'm not gonna rewatch) it would depend on where you're processing it. The time it takes to kick something out of the core's cache wouldn't really matter while you're not processing it, but if you can just process it in its new adjacent core when you need it it would be faster than sending it back to the original core, and faster than jumping up and down layers of fixed cache. AMD has shown little bit of extra latency seems worth it to gain a lot of shared cache so this system should have the best of both worlds.
@JJFX-3 жыл бұрын
@@Freshbott2 So in theory what gets kicked out to VL3 essentially becomes L2 again if the adjacent core is able to process it. I wonder if it would also be possible for something like Intel's big.LITTLE concept to sandwich their smaller cores between each 'pair' of L2...
@TheGraemeEvans3 жыл бұрын
Yeah, depends on some cores being idle for their cache to be available for others to borrow.
@solidreactor3 жыл бұрын
I loved the "Cache me outside, how about that" :D Imagine a Zen with each core having a 12.5 MB+ L2 cache, now that's a Gaming CPU for us "cacheual gamers" ;)
@vyor88373 жыл бұрын
Or 32mb of l3 and 4-8mb of l2
@paulgray13183 жыл бұрын
Cache lines matter ;).
@white_mage3 жыл бұрын
@@vyor8837 the sole idea of having an l1 cache is to make things a lot faster. then theres l2 cache as a backup for the tiny l1 cache, and then theres the l3 cache which is huge but slow. i didn't even think an l3 cache could be important to have because i thought it was super slow when i realized it existed. i still think a massive l2 is better than having l3, but if a massive l3 is what i have, i won't complain. is still better than just having l2 but again, i'd rather have a massive l2. i think is faster than go checking l3 of l4.
@vyor88373 жыл бұрын
@@white_mage 12.5mb is smaller than what we have now for cache. Amount beats speed, power draw matters more than both. L2 is really power hungry and really large for its storage amount.
@white_mage3 жыл бұрын
@@vyor8837 you're right, i didn't realize it was that big. if most of the imb chip shown in the video is cache then l2 has to be huge. what about l3? how does it compare to l2?
@Big_Tex3 жыл бұрын
You’d think you’d want a huge amount of L1 but there’s a cache-22 …
@user-ty2uz4gb7v3 жыл бұрын
🤦♂️
@EShirako3 жыл бұрын
...that was an awful, naughty pun and you should feel bad. Stop that! STOP GIGGLING, DARN IT! o.@; *IsKidding, that was terrible-in-good-punny-ways pun, which means it was awful but a pun, so a good-awful! Does bet such a naughty punster is giggling, though!*
@pirojfmifhghek5663 жыл бұрын
This joke has serious cachet.
@porina_pew3 жыл бұрын
Two random thoughts: The virtual L4 to me implies that socket to socket access is still faster than socket to ram. Is that right? I guess that also considers NUMA. How does L4 compared to local ram? Recently we see Intel go relatively bigger L2/core while not so much in L3, and AMD is going bigger L3/core with relatively small L2. Wonder what the thinking is behind that design choice.
@SirReptitious3 жыл бұрын
From what I remember, the reason AMD had the big L3 cache when they made Zen was so that no matter how many cores were being used, ALL of the L3 would be utilized to maximize performance.
@shadow70379323 жыл бұрын
Given their Z platform where they need shuttle data between CPUs all the time, it wouldn't surprise me if CPU to CPU is faster than CPU -> DRAM.
@winebartender66533 жыл бұрын
A lot would depend on the TLB structure and architecture, as well as what you're latency penalties are for going through the cache levels. He went into a lot of the decisions that can be made into what differences/decisions you can make with caches and their size. It's clear that AMD made an early decision in Ryzen's life cycle that L3 would be their low latency bulk storage when compared to DRAM. Considering the design inherently increases DRAM access time (relying on a GMI between cores and IO dies) they needed a way to make up that performance loss. Also, with their announcement of v-cache/stacked silicon, they have doubled down on this concept. My guess is that going to the chiplets design helped force their hand in this direction due to size constraints where it makes more sense to reduce the cache bottleneck at the die level as opposed to the core level. Also, you have to be very confident of core utilization to sacrifice die area for l2 in place of l3 as that space is wasted if the core isn't being used. This is where IBM's design is very interesting, but it would be interesting to see the latency of the virtual l3 in worst case (furthers cores apart). Maybe zen 4 or 5 brings stacking from the get go, allowing the base chiplet to have larger core specific cache (l1 and l2) the the upper die being used like v-cache for l3.
@kyoudaiken3 жыл бұрын
Yes socket to socket links are proprietary and very fast. DDR4 and DDR5 are currently the biggest bottlenecks in any system.
@RobBCactive3 жыл бұрын
@@winebartender6653 Also fetching data from DRAM costs more power, they were explicit about this when explaining the Infinity Cache for RDNA2. Not only a narrower bus to save die space but the cache improved power efficiency.
@OrangeEpsilon3 жыл бұрын
We all probably want to see lots of graphs and diagrams that show all the performance characteristics :D The idea of using virtual L3/L4 caches inside massive L2 caches is really brilliant. I would really like to explore this in my future master thesis that I will start in about two years.
@RobBCactive3 жыл бұрын
Is it so brilliant for the coder though? This will look like a massive distributed L3 with less predictable access times, rather than a typical L2 because the L2 is now inherently shared.
@kiseitai23 жыл бұрын
@@RobBCactive perhaps, but often times you end up working on legacy projects in which the stakeholder wants some refactoring done fast (often out of necessity). At that point, you don’t always have the bandwidth to tailor the code to the platform. Also, often, the code may be on an interpreted language (either moving from or to). That’s code you can hand tailor as easily for a cpu as C or assembly. My opinion is that a lot more code will passively benefit from this cache redesign than not.
@RobBCactive3 жыл бұрын
@@kiseitai2 But it's inherently giving less control, the kind of code you're talking about relies on L3 anyway; it's not the kind of high performance code that tunes to L1 & L2. Besides you can query cache size and size accordingly for the chunks you do heavy processing on. It's not as hard as you make out to dynamically adjust. I just think people need to recognise that there's potentially more cache contention and it may not be reasonable to process data in half L2 chunks. I don't really see only advantages of a very complicated virtual cache system over L2 per core backed by a large L3 which follows data locality principals. There's been enough problems with SMT threads being able to violate access control already
@OrangeEpsilon3 жыл бұрын
@@RobBCactive I think it is true that the access times will become less predictable because the size and partition of the L3 cache is dynamic. But what if we give the software some control over how to partition the virtual L3 cache? It could make the access times more predictable, but the OS could also move threads faster from one to core to another. Also a core could be put to sleep and then the associated cache can be configured as a non-virtual L3 cache. You see there are lots of things that make this interesting and could bring some advantages so that the disadvantage of unpredictable access times can be minuscule.
@abdulshabazz85973 жыл бұрын
@Fabjan Sukalia : Computational RAM is the future. Compute-in-Memory architectures will compete against and overtake Multicore designs- even those having gigabyte-sized caches.
@kungfujesus063 жыл бұрын
I do wonder if this, ever so slightly, increases attack surface by basically making every private cache line globally accessible. I wonder if there might be even better side channel attacks enabled by this.
@dfjab3 жыл бұрын
The sad reality is, that the fasterl you make the CPU cores the more of such attacks will be possible. The industry might not be ready to admit it yet but speed is most likely not compatible with security.
@kazedcat3 жыл бұрын
IBM could design it so that private cache is encrypted and the memory address is hashed and salted. This way a side channel attack might extract the private cache line data they need the keys to which should only be keep in the core to unlock the data.
@kungfujesus063 жыл бұрын
@@kazedcat I feel like the secure enclave concept tends to eventually be defeated. At least that was the case for Intel.
@kazedcat3 жыл бұрын
@@kungfujesus06 The goal is not to make a bullet proof system but make the exploits extremely hard to implement. If each logical thread have their own unique keys and salts then exploits need to string together multiple vulnerability instead of needing only one. Also extracting keys from the core will be harder than extracting data from a shared cache. The problem with secure enclave is that it outsource security to a separate dedicated core instead of each core handling their own security. This reduce the attack surface but also makes security a lot more fragile. You only need to crack that one security core and everything becomes vulnerable.
@ear4funk8142 жыл бұрын
@@kazedcat Or you can design and patent it if they haven't done so ... seems you have a solution to a real problem.
@CharcharoExplorer3 жыл бұрын
Super fast L1 with high associativity. Massive L2 cache that can work as L3 at times the way IBM designed it. V-cached L2 or L3 and a chiplet with decent, massive L4. That would be godlike.
@erichrybinczak88943 жыл бұрын
Agreed. With the new packaging tech from Intel and and next few cpus can be some interesting designs testing out new techs and designs until the fine the right optimizations and features that make cpus 2x the current performance.
@crustybread92863 жыл бұрын
Virtual L3 /L4 amazing . Wonder how ccoherence is communicated across cores across chips, how virtualization of apps affects flushing/invalidation these virtual caches. Amazing engineering !
@julkiewitz3 жыл бұрын
L2 / virtual L3 sounds more like something perhaps universally applicable. L4 sounds like something only useful in the context of the Z architecture. Consider that all those reliability guarantees must add quite a bit of overhead on the Z system. Not sure this will still apply to something like a dual socket x86. Still, the tech behind it must be absolutely impressive.
@nonae94193 жыл бұрын
1: I would rather describe the cache layout the other way round. The L1 cache is the one you work with, and anything you don't need is stored in RAM. This way the importance of cache is more pronounced, instead of sounding like just another 2% performance improvement. 2: Branch prediction is used for optimizing small loops (size dependent on µ-op cache size) and decreasing pipeline stalls. What you meant was cache prediction.
@deilusi3 жыл бұрын
I hope that marking l3/l4 works, if not this seems like a meltdown madness. Even if that virtual l3/l4 won't be migrated to our CPU's for security reasons, Massive l2 would be interesting to see.
@petergerdes1094 Жыл бұрын
This sounds great for mainframe style applications but sounds like a nightmare for any use case where you need to worry about side channel attacks from other processes.
@jacobpalecek74393 жыл бұрын
Thank you for your amazing content Ian. Even as a computer scientist (albeit focused on software only), I still learn new things from most of your videos!
@Snotkoglen3 жыл бұрын
That Linode ad with the buff American accent, cracks me up every time. :)
@FreeloaderUK3 жыл бұрын
Ah the American accent which slips into an English accent every few words.
@hammerheadcorvette43 жыл бұрын
😆😆😆
@Grass89Eater3 жыл бұрын
I think it is might work very well when not all 8 cores at the same time need to use their entire L2 cashe. Games usually has very uneven workloads between different threads, so it will probably work great for games.
@benjaminoechsli19413 жыл бұрын
16:21 The good doctor is at a loss for words at how huge and cool that is. Love it.
@deechvogt15893 жыл бұрын
As always Ian. Very fascinating. Keep the awesome content rolling.
@Supermangoaliedude3 жыл бұрын
I worked on the memory coherency unit for these core+cache units. It was fascinating to learn about and glad I could be apart of implementing it.
@HexForger3 жыл бұрын
12:32 Well, simply put this complements mainframe setup perfectly. I mean, even older MVS concepts like shared/common virtual memory, or the coupling facility, they all enabled Z to perform like no other infrastructure while still maintaining a high degree of parallelism and ridiculously low IO delay. Ironically, despite the "legacy" label being often attached to mainframes by the press, Z still remains one of the most sophisticated and advanced pieces of technology, to this day improved by each iteration.
@abzzeus3 жыл бұрын
I'm old enough to remember when ARM was used in RISC PCs, they went from ARM610 40MHz to StrongARM 203 with 16Kb of cache, the speed up was way way more than what the raw MHz would indicate. The BASIC interpreter now fitted in cache. Programs ran so fast it was insane. Some games had to be written/patched as they were unplayable. All down to the cache.
@estebanguerrero6823 жыл бұрын
Thanks for the explanation it is amazing to see this new approaches to latency and allocation :O
@Winnetou173 жыл бұрын
This is nice. This both allows L2 and "L3" and "L4", in bigger amounts than before, but also the opportunity for a single threaded workload to have a very nice very big 32MB of L2.5 cache. And still more L3 and L4, if the other cores are barely doing stuff. Huge increase either way.
@TreborSelt3 жыл бұрын
I'm so glad I found your Channel. It's the only way I can get info into the deep-dives of the tech industry. 😎
@Zonker663 жыл бұрын
That was very comprehensive. Never understood cache that well. Thank you.
@Netsuko3 жыл бұрын
I just wanted to say that your videos are fantastic! Really well explained and educational to watch. I never understood what L1, 2 or 3 Cache ACTUALLY did. Until now.
@superneenjaa7183 жыл бұрын
That's genius, but the potential problem I see here is that how are they pushing out old data? We won't want cases where a core doesn't have enough space on it's private cache cause it's full. So it goes to store elsewhere, but it's private cache was filled with old data from another cores which would not be used much again. A good amount of planning needs to go in there (which I suppose IBM engineers already did). However, I wonder if the virtual L4 helps at all. Shouldn't it be as slow as memory, since it's outside of the chip anyway?
@AdamBrusselback3 жыл бұрын
There are probably quite fast interconnects between the chips in the system, likely lower latency than RAM still otherwise they wouldn't bother.
@orkhepaj3 жыл бұрын
probably they can throw out data from other cores when they need the space
@norili75443 жыл бұрын
If a cache is full, the oldest data gets evicted. My understanding is that the caches will prioritize in-core usage as L2 over out-core usage as L3.
@countach273 жыл бұрын
This really is impressive, unless all cores are heavily utilized you have almost infinite amount of cache, crazy, crazy
@TheMakiran3 жыл бұрын
Considering having consistent 100% utilization is almost impossible, this is very cool
@fteoOpty643 жыл бұрын
@@TheMakiran Ian can tell you his Pi machine will certainly run very close to full bore on all cores!.
@Innosos3 жыл бұрын
I'm impressed that you seemingly have the money to buy a multi socket system. Us mere mortals might as well wait another decade before these systems become affordable, at which point, I presume, the requirements for common applications will have appropriately expanded to fill that "infinite amount of cache" rather quickly.
@humpadumpaa3 жыл бұрын
@@Innosos I'd say the equivalence to sharing the L4 cache between processors would be to share it between chiplets/tiles instead.
@hammerheadcorvette43 жыл бұрын
At 5.3Ghz, they are hoping to "clear" the L2 as fast as possible. That is reasonable as most of these machines are going into the Finance sector.
@jk-mm5to3 жыл бұрын
The L3 cache on my old 7900x has the same transfer rate as my DDR4 main memory. It seems that traditional L3 cache is obsolete with today's faster RAM.
@emperorSbraz3 жыл бұрын
L2 used to reside on the motherboard. I had one with a cache SLOT that fitted an extra 256KB on top of the onboard 256KB. :)
@SirReptitious3 жыл бұрын
Hmm, I don't remember ever seeing a motherboard with that feature! All the ones I recall using had two 256k SRAM chips soldered to the motherboard. And for some reason it was common enough to have the L2 cache go bad that the BIOS had the ability to bypass the L2 cache for when that happened, albeit with greatly reduced performance. And of course you will remember that back then too the memory controller wasn't in the CPU but the Northbridge. So we had a situation that today sounds insane; When the CPU needed data it would check it's L1 cache, if not there it would check the(external) L2 cache, and if not there it had to ask the memory controller on the Northbridge to go to ram to fetch the data and send it to the CPU. It sounds so barbarically crude now it's almost funny. But back then in the stone age process nodes were so huge there just wasn't room to put everything on the CPU. And if you wanted integrated graphics so you didn't have to buy a graphics card, did you buy a CPU with a GPU built-in? Of course not, that was also the job of the Northbridge too! LOL! There are a LOT of things I can say I miss about "the good old days" but that stuff is NOT part of it! ;-)
@robwideman25343 жыл бұрын
That was called a COAST module. Cache On A STick!
@Variarte_3 жыл бұрын
This kind of feels like an alternative and an extension to AMD's paper 'Analyzing and Leveraging Shared L1 Caches in GPUs' except for CPUs
@rattlehead9993 жыл бұрын
This also reminds me of the core 2 duo, where it had big L2 cache for the time.
@RobertVolmer3 жыл бұрын
Fun fact. The K6-III and K6-III+ at 1:41 did have 256KB of L2 cache on die.
@chinesepopsongs003 жыл бұрын
This sounds like a more practical way of using space. If you loose some cycles on the L2 but the gain is the L3 is faster and you save space so a bigger L1 could be done also. A bigger L1 and a faster virtual L3 might cancel out the latency loss of the slightly slower L2. We will see if Intel and AMD start to adopt this new idea.
@RobBCactive3 жыл бұрын
The video doesn't point it out but the different caches use virtual or physical addresses. Code uses its virtual addresses to run at full tilt, but to share cached memory between processes then a physical to virtual address translation using the process memory mapping is required. This is the reason L1 cache can be very low latency, while L2 & L3 are slower. Arguably IBM appear to be using a distributed L3 cache from the size, if you're coding high performance code, you really want to fit in the core's L2 for the main work, accessing L3 via RAM in predictable ways so the processor can anticipate and pre-fetch.
@AtariKafa3 жыл бұрын
i am pretty sure IBM can save GPU market but they have more important things to do they always solve very very big problems all the time
@FutureChaosTV3 жыл бұрын
IBM designs systems while AMD and Intel design products. That's how I would classify the difference.
@recai43263 жыл бұрын
Aga sen napıyon burada ya :D
@henryD93633 жыл бұрын
Apple M1 seems to be efficiently addressing that arena. And M1's just the first version.
@robertutter17633 жыл бұрын
It's pretty awesome that you used an AMD K6 III+ in your video. It's the first CPU to use level 3 cache. Of course it was off chip level 3 cache. But it was huge compared to what was available in level 2 cache of the time.
@_DarkEmperor3 жыл бұрын
Think different way, if AMD can put large L3 Vcache on top of processor die, it can redesign all on-die L3 cache to L2, You could have huge L2 cash and huge L3 ache.
@billkillernic3 жыл бұрын
Just a technicality correction that didn't sit well on me (1:53) They don't start small because they are really fast, they start small because they are really, really close to the CPU cores and the circuitry needs to be really, really short in order to be that fast which limits the available space that fits the two aforementioned criteria.
@justdoityourself71343 жыл бұрын
Per Jim Keller, ~ "performance can be reduced to branch prediction and data locality" ( And I Paraphrase ). Best most concise rule of thumb for all hardware and all software design IMO.
@hammerheadcorvette43 жыл бұрын
Memory is still the limiting factor in core performance. That has brough on this type of innovation and 3D stacking.
@justdoityourself71343 жыл бұрын
@@hammerheadcorvette4 that is what data locality means. How close the data is. E.g cache, memory, disk, network...
@Game__Boy3 жыл бұрын
Love the choice of a K6-III+ 400mhz ATZ in your graphic at 1:24 -- I'm using one of those in my super socket 7 vintage gaming PC, what a neat processor.
@Veptis3 жыл бұрын
Love the educational part. Really excited for the upcoming technology. But IBM processors really don't end up in consumer products, right? However the technology might
@henryD93633 жыл бұрын
Apple M1 does not have an L3 cache. CPU, GPU and all the RAM or on the same package. Close together. No connectors, pins, sockets and copper traces on a motherboard. Very similar.
@jecelassumpcaojr8903 жыл бұрын
The Tilera chips already used this idea - the private L1 caches could also be used as part of a chip-wide L2 cache. I myself have used this in some of my projects since 2001. Some of the ideas that make this work can be found in en.wikipedia.org/wiki/Directory-based_coherence
@VRforAll3 жыл бұрын
That's 20yrs ago, meaning key Patents may have expired. If so, other companies could start working on their own implementation.
@Girvo7473 жыл бұрын
“Possible” that IBM has patents out the wazoo? Hahaha it’s guaranteed that they do. When I partnered with them on a homomorphic encryption system I built, they tried to convince me to patent it, despite being a pretty simple implementation of the Pallier algorithm lol
@robinbinder86583 жыл бұрын
casually flexing he partnered with IBM like bruh
@S41t4r43 жыл бұрын
They also flex about how many patents they applied for a year and how many they have in total on presentations.
@robinbinder86583 жыл бұрын
@@S41t4r4 my achivements this year: - once ate a whole family size pizza alone get on my level
@S41t4r43 жыл бұрын
@Mahesh If you sold your product before the patent was admitted the other party either can't patent it , or looses the patent pretty easily
@Girvo7473 жыл бұрын
@Mahesh depends if the jurisdiction is first to file or first to invent
@jeffwells6413 жыл бұрын
A really nice analogy for caches is the Librarian. He keeps a handful of the most recently requested books on top of his desk, a stack of recently used books underneath his desk, a rack behind his desk with the most popular books in the last few days, and finally the main library where the books' are stored more permanently. As the Librarian gets a book request, he puts the book on his desk and moves the oldest book under his desk, then moves the oldest book there back to the shelf behind him, and then puts the oldest book on the shelf back into it's place in the library. Here is easy to see if the librarian keeps too many books on his desk he's going to have difficulty getting his work done - the books will be in the way. If he's got too many books under his desk it will slow him down as well, but he's got more room to arrange this area. The shelf behind him had even more flexibility in it's size and arrangement. That's a CPU and its L1, L2, and L3 caches, plus its main memory.
@Bengt.Lueers3 жыл бұрын
This is truly an amazing technology, and Ian's enthusiasm for it is palpable.
@erichrybinczak88943 жыл бұрын
I wonder with newer package technologies are we going to go back to the pent-up pro or pent-up designs with off die cache. With the newer interposed larger l3 or l4 can be on package using newer or other NM processes that can make cache faster or more power efficient.
@CjqNslXUcM3 жыл бұрын
AMD 3d v-cache is apparently twice as dense as the on-die l3, because it uses a library optimized for cache, on the same 7n process.
@fvckyoutubescensorshipandt27183 жыл бұрын
They really need to hurry up and make dies cube shaped with microholes in it for liquid cooling (millions or billions of layers instead of just 17). Cube vs square = both more cache and lower latency.
@notthedroidsyourelookingfo40262 жыл бұрын
2:11 I don't think that statement is correct. Isn't branch prediction more about pipeline utilisation than memory access? As in: the instructions for *after* a predicted jump get decoded *before* the jump condition is evaluated. Cache utilisation is more about data locality, as far as I understand it.
@almostmatt1tas3 жыл бұрын
This really came at the right time, I've been learning about cache lately and I didn't realise how much I didn't know :) I really thought it was a simple as more cache = more better.
@lahmyaj3 жыл бұрын
Not gunna pretend to know chip design too much but from the stuff I watch, Apple’s M1 is really good because it keeps its cores fed due to its design approach (the whole 8x instruction decoders, large caches etc. ). Is this IBM approach not similar in that it attempts to look at the problem differently than others and looks to utilise its resources better/more efficiently rather than just adding more resources?
@henryD93633 жыл бұрын
And... the M1 also does not have an L3 cache. Seems to work very extremely well
@DaxVJacobson3 жыл бұрын
Make me think of Intel's use of register renaming, technique that abstracts logical registers from physical registers, plus Intel loves cache on their chips, this already sounds like it's an Intel Technology.
@RobBCactive3 жыл бұрын
Everyone loves cache, because memory access is relatively much slower compared to CPU cycles. Intel have reduced cache sizes (P4) and love market segmentation with cache being a differentiator.
@WhenDoesTheVideoActuallyStart3 жыл бұрын
I regret to inform you that no, caches are in fact not "an Intel Technology", they're quite... widely used, to say the least.
@DaxVJacobson3 жыл бұрын
@@WhenDoesTheVideoActuallyStart There used to be a web site (behind a paywall I forget it's name) that would talk about problems with Intel CPU manufacturing process (before the +++++ stuff) and that site would always talk about Intel always wanting to be a memory company so their half ass processors always had a ton of cache to paper over what a crap processor they made, this being a technology to add a CRAP ton of cache to a CPU thus sounds like an Intel technology to me, to paper over the fact Intel is a crap CPU company that is a monopoly and the only thing keeping it in business is monopoly power and cache memory to make it look like an innovator.
@WhenDoesTheVideoActuallyStart3 жыл бұрын
@@DaxVJacobson Every modern processor has shittons of on-die cache, because they're all wide 5Ghz data crunching machines, while the basic DRAM cell hasn't changed much for decades (Besides the shrinking), and still has the same old access latency, which is monstruous when compared to modern on-die SRAM. Every year that passes DRAM is more and more of a restriction to the performance of CPU's, and pre-fetching as much data as possible into faster SRAM caches is one of the only solutions besides magically eliminating Amdahl's law or having STT-MRAM catch up with the 5nm node.
@hammerheadcorvette43 жыл бұрын
@@DaxVJacobson Companies have been trying to add a CRAP ton of cache to a core since the Arm days of the 80's. Intel sold many Haswell processors with 256MB of cache to the financial market.
@joxer3 жыл бұрын
Ok as someone who worked in IBM database systems (Db2 for Linux and Windows for nearly 30 years, a couple more in SQL/DS), a mainframe isn't the only way that your banking transactions can be kept safe. This is software, specifically the write ahead logging of your transactions and then adding things like dual logging, log shipping for high availability disaster recovery, fail over for nodes etc. We would commonly have 32 to 256 database partitions on 4 to 64 physical machines presenting to the end user as a single database for instance. On with the show.
@rattlehead9993 жыл бұрын
As per usual in the last 12 years, it's IBM doing the major innovations.
@katietree49493 жыл бұрын
Diggin that Konami Code brother, also, I came here from a Gamers Nexus video, and I gotta say you've earned a new subscriber.
@Psychx_3 жыл бұрын
And here I was, thinking that future AMD designs might incorporate a L4 cache on the IO die. Now I am not so sure about that anymore as there seem to be more options.
@TheStin7773 жыл бұрын
That would seem like the logical next step offer 3d stacking. (As far as I am understand it) but making a 'virtual' cache pool will be a LOT faster than L4, especially if it is like this with L2. Increasing size but drastically reducing latency.
@Psychx_3 жыл бұрын
@@TheStin777 It's all workload dependent though. The IBM Z series chips are tailored towards financial computing.
@IttyBittyVox3 жыл бұрын
Virtual cache sounds interesting, but I think I'd need to see examples of how cache eviction works in practice for some realistic loading scenarios before I get excited. Particularly for workloads we think it'll help in. For example does this design cope well with games that would benefit more from larger dedicated shared cache? Or does the design mean that the virtual L3 and L4 is squished out to make room for L2 when that wouldn't be optimal for the load in question?
@Bengt.Lueers3 жыл бұрын
Comment for the algorithm
@CaseAgainstFaith13 жыл бұрын
I had no idea that "mainframes" still exist. I thought modern banking systems simply used stacks and stacks of standard server chips. I had no idea there was a whole different processor design for modern mainframes.
@evenblackercrow44762 жыл бұрын
Thanks. Great explanation of cache and its relevance. I like your term 'workload profile' and its application to the flexibility of virtual cache.
@MichaelPohoreski3 жыл бұрын
Whoa! This is the same Dr. Ian that got AMD to price the Threadripper 3990X at $3990. Subbed!
@robertalker6523 жыл бұрын
You did a marvelous job in explaining and simplifying the subject matter, and I think many a layperson who has even a minimal grasp of system\processor architecture would not fail to grasp what you've presented.
@n.butyllithium54633 жыл бұрын
I like to toolbelt analogy for cache. The CPU registers are the tools in your hands, and you work directly with them. L1 is your toolbelt, bigger, but slower. L2 is your toolbox, you need to restock your toolbelt from there but it can hold even more tools. L3 is your truck. Lots of storage but you need to go all the way back to your truck. DRAM is the hardware store, almost all the tools available, but very far away. You lose hours just going over there and back. The Hard Drive is China, absolutely all tools, but takes months to get the one you ordered.
@shavais333 жыл бұрын
Super cool idea on IBM's part. So good to see there are still active and commercially viable processor designers out there besides Intel and AMD. When I went through my Computer Engineering undergraduate work, I learned about the layering process when manufacturing silicon chips. Expose the surface to a gas, let it dry and form a masking layer, shine a light through a film mask and through lense onto the surface to burn a pattern through the masking layer, expose that to a doped gas to add doping to the exposed silicon, use an acid to remove the mask layer. rinse and repeat with a different mask and maybe a different doping for a different purpose. At the time, they were doing 7 to 13 layers. I think often there is a grounding layer on the bottom and a power layer on top, with holes poked through the grounding layer for signal leads. I think adding more layers had diminishing utility because of cross talk (flowing current induces a magnetic field around it, which can, in turn, induce current flow in nearby conductors, there by interfering with any signals going through those nearby conductors) and heat dispersal (current flowing through a conductor that has some resistance produces heat and consumes power). It seems like both of those issues are a matter of materials research. If you can get very high resistance in a material that is achievable through a gas exposure process that conducts heat very well, you can disperse heat and prevent cross talk at the same time. And if you can get very low resistance, you can get more current through with less voltage and not only generate less heat and save power, but also charge up capacitance faster and thereby increase signal speeds. If you could get a semiconducting material that was super conducting when/where it is in its conducting state and has very high resistance (enough to basically completely block current flow) in it's non-conducting state, that would really win the game, because your power requirements would be incredibly low, your signal speeds incredibly high, and there would be no heat to disperse. I guess that would be the holy grail for this stuff. Anyway, the point is, it seems like a lot could be achieved by materials research to reduce heat and cross talk and then by adding a lot of layers. Imagine having a cube instead of a flat die. Right now die's are 100mm squared, and about 0.5 mm thick. Some are 0.275mm thick. If you had a die that was 100 mm cubed.. that could potentially theoretically multiply the transistor count by up to 200 or so. Of course it would also take many times longer to produce a wafer, so I guess that's an enormous cost problem. Which I guess probably means that nobody is working on anything like that. But when our processor chips are super-semi-conducting and are cubes (or honey-comb shaped?) instead of flat chips, that's when AI robots will speak fluently and be indistinguishable from humans over audio-only communication channels and be capable of taking over the work of world class human doctors, scientists, artists and engineers and doing a better job of those things.
@AlexanderKalish3 жыл бұрын
In my understanding it would only work if the records are being evicted from the caches not only due to being replaced by the newer data but also due to some sort of time of life eviction and possibly other methods because if you only have eviction by replacement with newer records then all L2 caches will be full after a short period of time and there will be no space in any of them to be allocated as virtual L3 cache. Also this time of life and other kinds of eviction should not evict it to some other core's cache but rather erase it because otherwise you'll just have all cores ping-ponging the same old data.
@capability-snob3 жыл бұрын
Lines are on a queue to be written back to RAM before they leave L1. You can safely discard anything in cache on most designs without impacting correctness. The hard part tends to be picking which lines to discard 🙃
@AlexanderKalish3 жыл бұрын
@@capability-snob well, the first thing that comes to mind is time of life since the last use of the line.
@nghiaminh77042 жыл бұрын
At 13:08: Intel Xeon is ~3Ghz; with 12 cycle latency is 12/3=4 something something. z16 is 19/5.2=3.6. Then why is Intel's cache considered faster?
@JoneKone3 жыл бұрын
cache wars are on full swing.
@Friend_of_the_One-Eyed_Ladies3 жыл бұрын
You explain things really well. Thumbs up.
@D.u.d.e.r3 жыл бұрын
Cache is ultra important for the future CPUs and your highlighting of eliminating L3 done by IBM was something what Apple did long time ago with their Axx ARM based CPU's where they completely dumped L3 and enlarged L2. Having virtual L3 cache out of L2 cache is definitely way to go, but we are still not where we should be with the system memory which runs as fast in clocks as CPU itself. Future computers and CPU's need to have memory as fast as CPU itself to achieve greatest possible efficiency and ideally if that memory would be also non volatile. Having RAM memory fast as fastest cache with capacity and non volatility sounds like a sci-fi, but its what we would need for the future. Thank you for highlighting this new design from IBM, its really something special and proves again that IBM is superb CPU design company!
@ragincaveman3 жыл бұрын
Tightly coupled local data stores are one way of reducing memory pressure. Can be private or coherent shared public depending on the workload and application.
3 жыл бұрын
My take - virtual L3 and L4 is there just for compatibility reasons, most things will only use L1 and L2 or will have to go to main memory anyway
@paulgray13183 жыл бұрын
IBM loves to sell chips with say 8 cores, even if you get a 4 core system as you can enable those other cores at a later time after paying IBM more money. Now the question is - will those unlicensed inactive cores also disable the l2 virtual l3 cache upon those inactive cores or will the way IBM locks cores off still leave that L2 cache exposed and enabled for the virtual L3 cache? What I find interesting about IBM's is the whole memory controller approach and was in effect reprogrammable for future memory interfaces, that's pretty epic.
@selohcin3 жыл бұрын
Incredible. We need this on consumer platforms RIGHT NOW.
@dasberserkr3 жыл бұрын
This concept reminds me of a paper I wrote on virtual scratch pads back in 2012... at the time there was also the concept of virtual caches... which allows you to borrow space from on-chip memory across other cores... very interesting to see IBM use something similar here... btw i like how the timelines show up as you speak, i will need to learn to do that for my videos :)
@ear4funk8142 жыл бұрын
You may need to check with a patent lawyer ... this is obviously a "big bucks" feature, and if you had mentioned "virtual caches" back in 2012, then the idea is public domain such that others can freely use the idea ... or maybe you are entitled to royalties as well (I'm sure IBM has patents issued on this).
@petergerdes1094 Жыл бұрын
@@ear4funk814Why would they? They obviously don't have the patent and the only question is whether AMD or apple etc gets sued for patent infringement at which point they'll do a paper search for prior art.
@YouHaventSeenMeRight3 жыл бұрын
I think you forgot about registers in your cache level story. Most modern GPU's have a large register section for the immediate access to data. L1 cache has typically 3-5 clock cycle delays, registers are immediately available. Typically the register section of a GPU is much larger than that of a CPU, so it does come into play more than registers do for CPU's.
@Mammothtruk3 жыл бұрын
cache to VL3 means throwing it from one memory controller to another and a cycle taken to do so, which means two memory controllers would have to wait a cycle to do anything else. I can see the benefit, but also possible issues. cross talk might stall other things til it is done and possible attacks to force a leak of memory from one to another to then be read? be interesting to see where they all go with all the new cache things coming around.
@jpdj27153 жыл бұрын
You omit one thing: the type of memory cells used for these caches. In 1995, IBM had a UNIX (AIX) server that used static RAM for cache and it could be expanded. In line with your point, throwing more cache into a base model made it faster than the next up model with a more powerful CPU. The nature of D-RAM is that it is unstable and slow. That static RAM had less that 1/8th the latency of DRAM. And it was very expensive - probably because of very low production volumes. I always wondered why the DRAM in my PC had not gotten replaced by static RAM in the past decades. Oh, mainframes - next to billions of smartphones and PCs, and a couple Macs and iPhones, there were about 10,000 IBM mainframes in use in 2019.
@nedegt18773 жыл бұрын
I think 3D Cache or Stacked Cache is going to be the future for a couple of reasons. 1. Space. Like you said in the video. Stacked Cache allows bigger and faster Cache, 4GB is doable. 2. Price. The logic behind Virtual Cache systems is more expensive than 3D Cache. 3. Applications. There are tons of ways you can create very fast chips using 3D design. Imagine a 3D ( Stacked Chip) that's 3 layers. 1st layer for CPU's, 2nd for Massive Cache, 3rd GPU's. That's a very fast and crazy chip having both CPU and GPU using the L3 Cache in the middle. That's some pretty awesome power you get in 1 chip package for the majority of workloads. But the best idea would be a hybrid 3D Virtual Cache. Eventually you'll see CPU's getting GB's Of very fast memory, be it L2, 3 or 4.
@unvergebeneid3 жыл бұрын
Why does "searching through" a larger memory take more time? What's the mechanism there? Is that delay in the LUT or in the actual memory addressing?
@BRNKoINSANITY3 жыл бұрын
Picture it like an index. It is hunting through the "terms/references", and a larger book will have a longer index. It's got to find the location of the data in that index before it can retrieve it.
@capability-snob3 жыл бұрын
The signal takes time to travel through a physically larger cache, and then to travel back. Were at the sort of clock speeds now where even at the speed of light, an unimpeded 5GHz signal could only cover 6cm in a cycle. Add to that, driving busses with many outputs gets inefficient quickly and drives down the slew rate on the signal, so there is a few levels of muxing between the query and the actual lines. I bet finding the balance with each new process is quite the challenge.
@redtails3 жыл бұрын
I'm not getting a clear answer whether the shared l2 is faster compared to l3. if the shared l2 is on another chip, it might as well be slower than l3 is, or maybe even comparable to ram. would be of interest to just have a table if scenario versus latency
@martylawson16383 жыл бұрын
Afik cache uses a fixed mapping between main memory addresses and cache addresses to maximize search speed. Which means that useful data is never packed densely and there are lots of gaps filled with vacant cache memory all the time. Sounds like an excellent resource to turn into a virtual L3/L4.
@neutechevo3 жыл бұрын
There seems to be a right time for each approach and design to work. IBM seems (always) to be right at the tip of the spear , that is innovation. At earlier times such Large pools without some higly intelligent mechanism couldn't make sense. If i am not mistaken IBM also had a connection with Global Foundries and the first Zen iterations , the whole CCX , Core , Cache Structure , i think was influenced by the trends of the μ-arch at the time and each time IBM seems to be there. Also it seems that Intel is starting to make more larger L2 structure with the new cores..maybe going towards the approach? I hope your channel keeps growing , it transcribes to me some good vibes and has a feline feeling to it..
@tomcarlson39133 жыл бұрын
One thing to think about is the virtual caches will never be as big in practice as they are in theory. L2 in use subtracts from available L3 and both L2 and L3 subtract from L4. On multi-core loads with heavy L2 usage, L3 and L4 could effectively become tiny. Depending on the cache management software the model where cache size drops with cache number could be turned on it's head and cause problems. The cache management software's design and performance will be a key player in making or breaking this technology in the field.
@justdoityourself71343 жыл бұрын
This is super context sensitive. For example on normal desktop workloads, more L2 would be great... but mainly if there were less cores. Because there would be more task swaps. But with higher core count the scheduler remembers the process or thread affinity for this very reason to keep the specific core cache hot for that task. This is so complicated. I can't really imagine it being a guaranteed improvement. Very interesting, lots of power vs performance vs workload vs software optimization type considerations.
@blazbohinc49643 жыл бұрын
Hey Ian. What do you think about PrimoCache?
@uncurlhalo3 жыл бұрын
I work in a DC for a bank and they just installed a load of the newest model of Z Mainframes, these things are sick, and some of them even have liquid cooling built in. Amazing machines.
@123argonaut3 жыл бұрын
"virtual cache"... I'll be darned. I knew nothing about computers, but now I know about "virtual cache"... cutting edge stuff.
@SalvatorePellitteri3 жыл бұрын
Do you think HBM clould be used as massive 10's of GB of L4?
@law-abiding-criminal3 жыл бұрын
Are L1 L2 and L3 caches searches simultaneously are after each other? The one solution is faster while the other is more energy saving
@henryD93633 жыл бұрын
Good question. I wish I knew the answer. I assume that L2 is searched only if there's a miss on L1. Similarly L3 searched only if miss L2. In a different KZbin video someone said that IBM systems don't care anything at all about energy. So I think that would hold here.
@kazedcat3 жыл бұрын
The search is not very time intensive. Majority of the cache latency is cause by wire impedance. So the amount of time you save by simultaneous search is very small and you are stuck waiting for the signal to traverse through the wires.
@emscape23 жыл бұрын
I knew someone would come up with this at some point. Props to them
@D.u.d.e.r2 жыл бұрын
This is an excellent report and overall analysis Ian, well done!👍 IBM is really creating something special with Telum and its approach even quite different with the virtual, shared L2 cache reminds me how Apple ditched the slower L3 cache and rather expanded the L2 cache in their newer AXX chips. Still IBM is coming with something special again which might inspire both team blue and red as well as Arm. Regarding cache itself, it would be awesome to have one day GBs of high speed cache on the SOC package which would completely replace the DDR and even HBM. Its quite clear that future of the memory is to get it run much faster so that RAM is running almost as fast as cache and non volatile memory like Flash runs as fast as RAM. Ideal world would be to have one single, unified memory which runs as fast as the fastest Flash, but it's non volatile😁
@joechang86963 жыл бұрын
in the multi-core era, Intel processor cache were inclusive, so whatever is in L2 is also in L3. On L2 miss, it is only necessary to look in the shared L3 without worrying about what's in other cores (private) L2. With Skylake (SP), L2/L3 became exclusive (AMD did this earlier?). On initial load, memory goes straight to L2. On eviction from L2, it goes to L3. To implement this, each cores' L2 tags is visible on the ring. This being the case, L3 is no longer particularly necessary - however, it may be desirable to keep L2 access time reasonably short.
@Sunny-gt8zi3 жыл бұрын
Thank you so much for going through these more deliciously geeky topics that others are not informing us about on youtube! I really appreciate it :D
@woswasdenni19143 жыл бұрын
where aare the FPS comparison charts for that mainframe, asking for a friend
@Angmar33 жыл бұрын
Nice video, a great explanation of how cache works
@deez60053 жыл бұрын
Love your in depth content. Love the channel. Keep up the quality work
@joe72723 жыл бұрын
that's a very good strategy for "big data". Memory prediction accuracy is likely much higher too meaning it takes full advantage of the size.
@congchuatocmay4837 Жыл бұрын
Caching is very important if you want to do large fast Walsh Hadamard transforms. Which you might for SwitchNet.
@KaosArbitrium3 жыл бұрын
This has a lot of potential. I'd be a bit worried about the tracking accuracy and the security, but so long as they've kept such things in mind, this should work out amazingly.