The Anatomy of a Modern CPU Cache Hierarchy

Рет қаралды 7,810

Күн бұрын

Пікірлер: 31

@TheUpriseConvention 4 күн бұрын

Thank you so much for your videos! I’m currently a machine learning engineer trying to cover the computer science theory I never learnt at school. These videos are a goldmine!

@BitLemonSoftware 3 күн бұрын

I'm glad it helps you ☺️

@KrisRyanStallard 3 күн бұрын

Excellent video. Informative without getting bogged down in too many unnecessary details

@BitLemonSoftware 3 күн бұрын

Thanks! It's really nice to hear.

@harshnj 2 күн бұрын

You have been subscribed. Just don't stop making this quality videos

@BitLemonSoftware 2 күн бұрын

I won't ☺️

@szymonozog7862 3 күн бұрын

Love the series so far, keep it going!

@mateuszpragnacy8327 3 күн бұрын

Really good videos. It is really helping me design cache for my Minecraft cpu ❤

@BitLemonSoftware 3 күн бұрын

Thanks! Good luck

@stefanopilone957 4 күн бұрын

thanks, very clear, I liked & subscribed

@BitLemonSoftware 3 күн бұрын

Thank you for the support!

@dj.yacine 4 күн бұрын

Thanks 👍. high quality 💯

@abunapha 2 күн бұрын

תודה

@der.Schtefan 3 күн бұрын

You did not explain associativity.

@BitLemonSoftware 3 күн бұрын

I explained it in the first video of the playlist

@chetan_naik 3 күн бұрын

Informative video but why stop at L3 or L4 cache why not add L5, L6, L7 and so on cache to improve performance?

@turner7777 3 күн бұрын

Probably diminishing returns with increased cost and complexity

@der.Schtefan 3 күн бұрын

It's the way how memory is implemented. By the time you are past L3, the latency starts being almost as "slow" as main system ram on a memory bus. L1 is usually very expensive and space eating SRAM

@jedijackattack3594 3 күн бұрын

This has been done before. Intel did a L4 cache on broadwell for certain C cpu chips using a big external Sram die. The first problem is that Cache is rather expensive. Die cost scale exponentially with die size so a 100mm^2 die is 4x as expensive as a 50mm^2. And modern CPU cores are actually quite small zen 5 is only around 4mm^2 but the on the full zen 5 CCD half the die area is that 40 MBi of L3 cache and the 8MBi of L2. And as an additional problem thanks to the high clock speeds, as the size of the cache increases the latency will increase as well just moving the data from the cache back to the processor. Cache is also quite power hungry even when off so they tend to want to minimise it for consumer platforms, especially if it is going to be idle a lot like a phone or as intel did allowing the whole cache to be powered off on the performance cores. As for why we still don't see them trying with L4 or L5 cache, there are a lot of so called cache unfriendly workloads. These work loads tend to have a few things in common. Low levels of exploitable instruction level parrallelism, high levels of random branchs (especially consecutive branches) and a large random and sparsely accessed dataset. Doing these things tends to result in a processor being unable to effectively speculate ahead or reorder instructions to hide latency leaving it purely at the mercy of the memory subsystem to determine how long the stall will be. Thanks to the random sparse accesses the caches are unlikely to contain the right data as data is being trashed and discarded constantly and is unlikely to be prefetched correctly thanks to the randomness. A bigger cache may allow you to brute force this problem as AMD has proven with the X3D line of chips but if the data set is still sufficently bigger than cache, that hit rate is not improvement, you will see no improvement in performance. And if your new big cache doesn't have enough bandwidth to feed the improvement in hit rates you will also have a problem where the cores end up stalling waiting to the cache to actually get around to servicing their request. Making a cache higher bandwidth makes it bigger and more expensive. This is part of the reason that a lot of modern CPU optimisations and a lot of HPC software optimisation focuses on how to make sure the data and istruction stream is as predicatable and cache friendly as possible.

@chetan_naik 2 күн бұрын

@@jedijackattack3594 Well explained, I also wonder when cache miss occurs would the latency be just RAM latency or RAM latency + latencies of all the cache levels combined?

@jyotiradityasatpathy3546 2 күн бұрын

@@chetan_naik depends on the access architecture, usually serial and not parallel, which means the latencies would be added. However a main memory access time is far far larger than a register file access

@stachowi 2 күн бұрын

very good (and to the point).

@BitLemonSoftware 2 күн бұрын

Thanks

@abskrnjn 2 күн бұрын

How did you made this video, cool visuals

@BitLemonSoftware 2 күн бұрын

Thanks. Photoshop and CapCut

@hatsuneadc Сағат бұрын

What happens if cache is not available yet in the L3 when another core tries to access it? Does it wait for it to propagate? Or does it take the last known (old) state?