Thank you so much for your videos! I’m currently a machine learning engineer trying to cover the computer science theory I never learnt at school. These videos are a goldmine!
@BitLemonSoftware3 күн бұрын
I'm glad it helps you ☺️
@KrisRyanStallard3 күн бұрын
Excellent video. Informative without getting bogged down in too many unnecessary details
@BitLemonSoftware3 күн бұрын
Thanks! It's really nice to hear.
@harshnj2 күн бұрын
You have been subscribed. Just don't stop making this quality videos
@BitLemonSoftware2 күн бұрын
I won't ☺️
@szymonozog78623 күн бұрын
Love the series so far, keep it going!
@mateuszpragnacy83273 күн бұрын
Really good videos. It is really helping me design cache for my Minecraft cpu ❤
@BitLemonSoftware3 күн бұрын
Thanks! Good luck
@stefanopilone9574 күн бұрын
thanks, very clear, I liked & subscribed
@BitLemonSoftware3 күн бұрын
Thank you for the support!
@dj.yacine4 күн бұрын
Thanks 👍. high quality 💯
@abunapha2 күн бұрын
תודה
@der.Schtefan3 күн бұрын
You did not explain associativity.
@BitLemonSoftware3 күн бұрын
I explained it in the first video of the playlist
@chetan_naik3 күн бұрын
Informative video but why stop at L3 or L4 cache why not add L5, L6, L7 and so on cache to improve performance?
@turner77773 күн бұрын
Probably diminishing returns with increased cost and complexity
@der.Schtefan3 күн бұрын
It's the way how memory is implemented. By the time you are past L3, the latency starts being almost as "slow" as main system ram on a memory bus. L1 is usually very expensive and space eating SRAM
@jedijackattack35943 күн бұрын
This has been done before. Intel did a L4 cache on broadwell for certain C cpu chips using a big external Sram die. The first problem is that Cache is rather expensive. Die cost scale exponentially with die size so a 100mm^2 die is 4x as expensive as a 50mm^2. And modern CPU cores are actually quite small zen 5 is only around 4mm^2 but the on the full zen 5 CCD half the die area is that 40 MBi of L3 cache and the 8MBi of L2. And as an additional problem thanks to the high clock speeds, as the size of the cache increases the latency will increase as well just moving the data from the cache back to the processor. Cache is also quite power hungry even when off so they tend to want to minimise it for consumer platforms, especially if it is going to be idle a lot like a phone or as intel did allowing the whole cache to be powered off on the performance cores. As for why we still don't see them trying with L4 or L5 cache, there are a lot of so called cache unfriendly workloads. These work loads tend to have a few things in common. Low levels of exploitable instruction level parrallelism, high levels of random branchs (especially consecutive branches) and a large random and sparsely accessed dataset. Doing these things tends to result in a processor being unable to effectively speculate ahead or reorder instructions to hide latency leaving it purely at the mercy of the memory subsystem to determine how long the stall will be. Thanks to the random sparse accesses the caches are unlikely to contain the right data as data is being trashed and discarded constantly and is unlikely to be prefetched correctly thanks to the randomness. A bigger cache may allow you to brute force this problem as AMD has proven with the X3D line of chips but if the data set is still sufficently bigger than cache, that hit rate is not improvement, you will see no improvement in performance. And if your new big cache doesn't have enough bandwidth to feed the improvement in hit rates you will also have a problem where the cores end up stalling waiting to the cache to actually get around to servicing their request. Making a cache higher bandwidth makes it bigger and more expensive. This is part of the reason that a lot of modern CPU optimisations and a lot of HPC software optimisation focuses on how to make sure the data and istruction stream is as predicatable and cache friendly as possible.
@chetan_naik2 күн бұрын
@@jedijackattack3594 Well explained, I also wonder when cache miss occurs would the latency be just RAM latency or RAM latency + latencies of all the cache levels combined?
@jyotiradityasatpathy35462 күн бұрын
@@chetan_naik depends on the access architecture, usually serial and not parallel, which means the latencies would be added. However a main memory access time is far far larger than a register file access
@stachowi2 күн бұрын
very good (and to the point).
@BitLemonSoftware2 күн бұрын
Thanks
@abskrnjn2 күн бұрын
How did you made this video, cool visuals
@BitLemonSoftware2 күн бұрын
Thanks. Photoshop and CapCut
@hatsuneadcСағат бұрын
What happens if cache is not available yet in the L3 when another core tries to access it? Does it wait for it to propagate? Or does it take the last known (old) state?
@anonymoususerinterface2 күн бұрын
Can I ask where you get this knowledge from? I would like to know more!
@BitLemonSoftware2 күн бұрын
My own knowledge as software/firmware engineer + research I do for each video. You can see the sources I used in the description