Apple M1 Ultra & NUMA - Computerphile

  Рет қаралды 254,607

Computerphile

Computerphile

Күн бұрын

Apple's latest M1 chip is two older chips bolted together, Dr. Steve Bagley explains how they made it work the same as a single chip.
/ computerphile
/ computer_phile
This video was filmed and edited by Sean Riley.
Computer Science at the University of Nottingham: bit.ly/nottscomputer
Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

Пікірлер: 389
@50PullUps
@50PullUps 2 жыл бұрын
This entry is pure gold. Please make more vids where the latest tech is a jumping off point for the main topic.
@Stopinvadingmyhardware
@Stopinvadingmyhardware 2 жыл бұрын
Where did I apply for that?
@oskrm
@oskrm 2 жыл бұрын
That's the thing, this is not the latest tech.
@nezbrun872
@nezbrun872 2 жыл бұрын
NUMA's not new, it's been a facet of multi socket Xeon systems for many years for example, and other architectures too before that. The battle has always been to make the interconnect interfaces (QPI/UPI in Intel speak) as quick as possible to maximise performance. Software like RDBMSs are NUMA aware to optimise workload across sockets (and hence memory domains).
@darkidz24
@darkidz24 2 жыл бұрын
It could really take this channel to the next level!! Explaining modern day tech
@SilentlyContinue
@SilentlyContinue Жыл бұрын
Yes! Helps with understanding real world application.
@TechTechPotato
@TechTechPotato 2 жыл бұрын
Intel's EMIB, similar to ultra fusion, in Sapphire Rapids adds additional latency of 5-8 nanoseconds. This makes the core-to-core latency go from 54 worst case to 70 worst case. Apple's situation is similar, with similar bandwidth per connection. We expect the latency to be an additional 5-8 nanoseconds also. Ultrafusion is using TSMC's InFO_LSI manufacturing.
@Joseph_Roffey
@Joseph_Roffey 2 жыл бұрын
But the difference is one is called “random string of letters” and the other is called “Ultra Fusion” 😍
@eddyecho
@eddyecho 2 жыл бұрын
@@Joseph_Roffey huh? More like one is a "stupid marketing name that really doesn't describe the underlying mechanism" and the other is called "embedded multi-die interconnect bridge"
@landspide
@landspide 2 жыл бұрын
@@Joseph_Roffey And begins with "We call this..." and is filled with "... only at Apple can we ..."
@shunyaatma
@shunyaatma 2 жыл бұрын
Any numbers for AMD (Zen 2 and 3) 2-socket systems with and without xGMI cables?
@egor1g
@egor1g 2 жыл бұрын
yeah, but it is ARM vs x86, 256 channel memory against 6 and also efficiency cores, also video memory... so not really the same!
@NinjaAdorable
@NinjaAdorable 7 ай бұрын
This has been one of the most intuitive and elegant explanations for NUMA I have ever heard!! Kudos
@prla5400
@prla5400 2 жыл бұрын
Back to you, Steve
@BenjyP.
@BenjyP. 2 жыл бұрын
I read ml instead of m1 so I thought this would be a video of how the neural cores work. I would love a video on how to use the apple neural cores for machine learning as they already take up 20% space of the entire chip
@paulledak291
@paulledak291 2 жыл бұрын
Nice explanation of how NUMA architecture is implemented. However, you stated that the reason for moving to this architecture is because as you add more and more cores, you increase the probability of memory collisions. But then you completely forgot to explain how having 2 memory banks reduces the probability of the memory collisions that you would still get as you add the more processors. It would seem to be the most essential element needed for this video which is completely missing. (Yes I understand that now there are 2 memory banks with twice the bus bandwidth but this is never explained. And there are different interleaved memory architectures which could increase the memory bandwidth without resorting to NUMA)
@bberakable
@bberakable 2 жыл бұрын
Agree 100%
@mytech6779
@mytech6779 2 жыл бұрын
Its not bandwidth at issue, simultaneous access is the issue, this allows the banks to be accessed in parallel. Its like using a network bridge to make two ethernet sub-nets. Which I just realized is a really outdated reference as nobody uses shared media networks anymore. But basically all computers on a subnet could hear all packets on that subnet as it was physically one solid wire, and as more nodes were added you would get more chance of collisions and congestion,(non-linear increase) so chop it in two with a bridge(like a filter of sorts) so only about half of the total traffic can be seen, because only packets addressed to the other subnet are passed through the bridge.
@Sandeep-cz7ls
@Sandeep-cz7ls 2 жыл бұрын
@@mytech6779 wait im still confused, how does this allow the banks to be accessed in parallel? is it due to the interconnect?
@valshaped
@valshaped 2 жыл бұрын
@@Sandeep-cz7ls Each bank can be accessed by one CPU at a time More banks -> more CPUs at a time
@MaulikParmar210
@MaulikParmar210 2 жыл бұрын
@@Sandeep-cz7ls to keep it simple in modern day CPUs or lets say CPU cluster - there's memory controller inside each CPU cluster that makes request on behalf of physical CPU die, but in numa, there are multiple clusters acting on it's own so there are multiple access point to access different or same memory banks by different cpus. When two controllers try to access same bank and location it is going to be parallel access and cause lot of data inconsistencies when read and write at same time from different CPUs, unless it is handled on software level so that software is aware of such architecture. OS knows memory space and kernel is generally responsible to make sure each cpus request is translated in proper order and proper physical location by making use of translation tables or other hardware means that boosts this process depending on what's available. In NUMA these are much complex as each node has to communicate and coordinate exactly what they need, that's where connecting febric comes in, which provides crucial functions to get data in and out of foreign clusters. Keep that in mind when we talk about software, it's mostly OS level softwares and not consumer APIs, as consumer APIs make abstraction of these traits, your software would never know or has to care, if it's running on 1 core, 4 core or 12 cores 2x CPU sockets, in the eye of usespace resources are unified, unless you want to optimise then ofc you can request system to allocate memory near resource, that's the job of OS to maintain and abstract hardware and allow controlled access via syscalls or driver APIs.
@as-qh1qq
@as-qh1qq 2 жыл бұрын
Why does making the interconnect (distributed shared memory) super-fast not bring back the original problem that we were trying to solve - increased memory access collision with increased CPUs? After all, if far away CPUs can access memory in nearly the same time as the nearby ones, how is it any different than just one memory with all near and far CPUs connected to it ?
@ssvis2
@ssvis2 2 жыл бұрын
It probably would reintroduce the problem. However, I would suspect there is some trickery under the hood of the OS working with the hardware to optimize data locality to keep the data on the "near" memory for any core. It's possible that part of it is memory mapping in the data interconnect so that memory on the "far" chunk could still be viewed as local to a core, and the super fast interconnect effectively negates the performance penalty that a traditional NUMA system would have.
@samuie2
@samuie2 2 жыл бұрын
I agree that it was not super clear in the video. I think you could still have that issue. however, it happens half as often since you have 2 banks of memory.
@davidgillies620
@davidgillies620 2 жыл бұрын
I would guess that it means you don't _have_ to tune data affinity (which makes development/deployment easier and therefore cheaper) but you _can_ if you want (which gives you the benefits of an optimised NUMA configuration).
@ssvis2
@ssvis2 2 жыл бұрын
@@davidgillies620 I'm thinking the same thing. By optimizing specific parts of the system, Apple has theoretically designed something that will perform really well in 99% of use cases. There's always more performance to squeeze out, but with severely diminishing returns.
@gajbooks
@gajbooks 2 жыл бұрын
UltraFusion is really just a memory... Fusion. Their memory gets twice as fast since they have twice as many banks, they just need a way to combine the M1 chips so that both of them can use the other's memory at high speeds. There was probably some tradeoff with the memory controller or packaging which made them need 2x64 rather than having external 128 GB. I imagine their real Mac Pro replacement will have external memory and GPU.
@TheMrKeksLp
@TheMrKeksLp 2 жыл бұрын
Mordern CPUs are only Harvard architectures in the most pedantic classification. Instructions are still kept in main memory, they just have a separate level 1 instruction and data cache. Even level 2 and 3 are shared...
@doctorpex6862
@doctorpex6862 2 жыл бұрын
Netflix gains most of speed by "video is not available in your country"
@markholm7050
@markholm7050 2 жыл бұрын
Can one still purchase green lined, perforated line printer paper or are you working off an old stock? That stuff was great for physics homework. Worked pretty well in line printers, too.
@sajukkhar
@sajukkhar 2 жыл бұрын
Dot matrix paper is still sold.
@rabidbigdog
@rabidbigdog 2 жыл бұрын
I'm convinced there is warehouse in Nottingham that is full of nothing but that tractor paper, just for Computerphile.
@davidgillies620
@davidgillies620 2 жыл бұрын
You can buy a couple of thousand feet of the green ruled stuff for about forty quid from any wholesale stationery supply store.
@arpanmajumdar617
@arpanmajumdar617 2 жыл бұрын
I think they are still available at Dunder Mifflin.
@heisen9460
@heisen9460 2 жыл бұрын
@@arpanmajumdar617 lol
@SaiPhaniRam
@SaiPhaniRam 2 жыл бұрын
Excellent presentation .. Simple and easy to understand 👏
@ipurelike
@ipurelike 2 жыл бұрын
thanks for the technical explanation!
@petrilaakso7927
@petrilaakso7927 2 жыл бұрын
Excellent explanation of NUMA, excellent work🙏🏼
@user-cc8kb
@user-cc8kb 2 жыл бұрын
Great explanation. Thanks!
@jaopredoramires
@jaopredoramires 2 жыл бұрын
The camera and lighting on this one looks incredible
@JJ-fq3dh
@JJ-fq3dh 2 жыл бұрын
Great video, brings back memories of codiding on an sgi origin 2000 and irix
@shaneclk9854
@shaneclk9854 2 жыл бұрын
Excellent video
@aipsong
@aipsong 2 жыл бұрын
Excellent, instructive video - thanks!
@grahmn886
@grahmn886 2 жыл бұрын
Lesson of the day, Thanks as always Steve :)
@RegitYouTuber
@RegitYouTuber 2 жыл бұрын
Favourite bit of this was the chaotic side-angle crash zoom - really compliments the desperate addition of “well of course it’s more complex than this, but” that seems necessary these days
@kuroexmachina
@kuroexmachina 2 жыл бұрын
this channel is gold. always has been
@Derbauer
@Derbauer 2 жыл бұрын
Nicely explained!
@salmiakki5638
@salmiakki5638 2 жыл бұрын
*It's only the firsts two generations of threadripper CPUs that have 2 NUMA nodes. The last one and both generations of threadripper pro have unified the memory Access
@romevang
@romevang 2 жыл бұрын
Threadripper 2990wx has 4 NUMA nodes. 2950x i think has 2.
@salmiakki5638
@salmiakki5638 2 жыл бұрын
@@romevang thanks, i though I remembered it was the same throughout the range
@AL-vc9xc
@AL-vc9xc 2 жыл бұрын
Wow very well and simply explained. Not in a math profession. But I did understand this write well! Thank you!!
@OscarBerenguerPV
@OscarBerenguerPV 2 жыл бұрын
This was a great video
@KipIngram
@KipIngram 2 ай бұрын
It's worth noting that the PCI ports are usually also split into these two domains, so you want to take that into account as well.
@vernonthomas6554
@vernonthomas6554 2 жыл бұрын
Love your channel.
@danielsilva158
@danielsilva158 2 жыл бұрын
Would’ve been good to touch on how this memory system interfaces with the gpu!!
@qwertypnk9401
@qwertypnk9401 Жыл бұрын
Nice, good job!
@sholinwright6621
@sholinwright6621 2 жыл бұрын
Don’t you still have to write code to distribute the memory hits across the two memory banks or you just have the same multi core stalling effect mentioned earlier. The speed up was the ability to partition core memory fetches into two batches preventing all of the cores stalling trying to fetch from the same bank. Side note: I work on a radar with 11 cpu cards with an 88000 on each and 2 MB of local ram with the collection tied to 2 global memory cards with 8 MB.of ram. GRAM memory fetches are really expensive.
@iammakimadog
@iammakimadog 2 жыл бұрын
Thank you!
@Sierra-Whisky
@Sierra-Whisky 2 жыл бұрын
What an excellent explanation! And what a coincidence too. I tried to explain NUMA and the potential performance hog on the exact same day this video was published but obviously my explanation was nowhere near as clear as this one. 🤣 Thanks! I'll share it with my colleagues.
@user-cx2bk6pm2f
@user-cx2bk6pm2f 2 жыл бұрын
Finally!! I understand NUMA.. thank you !
@bentationfunkiloglio
@bentationfunkiloglio 2 жыл бұрын
Quite informative.
@tomdchi12
@tomdchi12 2 жыл бұрын
Doesn't Apple provide the compilers (and IDE) so couldn't they be baking in the modifications to the code that is required to manage the non-uniformness of memory access times? (Regardless, early benchmarks indicate that performance is scaling only a little short of linearly with the number of cores, so we can infer that memory access across the two halves of the "fused" CPU isn't creating major delays.)
@Hooorse
@Hooorse Жыл бұрын
Thank you
@bosco4533
@bosco4533 Жыл бұрын
I love this channel. /message.
@gorunmain
@gorunmain 2 жыл бұрын
This is great!
@nameunknown007
@nameunknown007 Жыл бұрын
Love you man!
@itsMunchkin
@itsMunchkin 2 жыл бұрын
Waw! Powerfully explained.
@dembro27
@dembro27 2 жыл бұрын
Cool stuff. But now I have "Numa Numa" in my head...
@kanishk9490
@kanishk9490 2 жыл бұрын
Yeah me too.
@JohnnyWednesday
@JohnnyWednesday 2 жыл бұрын
Thank you kindly Dr. Bagley for sharing your knowledge with us. I'm quite surprised that Intel and AMD have not yet pushed for on-die memory given the M1's impressive demonstration
@SimonVaIe
@SimonVaIe 2 жыл бұрын
It does have some negative consequences. More expensive to produce, not expandable, if one thing breaks the whole thing is broken. I also don't know how much expertise would be required in ram design/production (keep in mind that Apple is far bigger than intel, which is far bigger than AMD) seeing there is a very well established ecosystem of memory manufacturers (they do have quite extensive cache systems on their CPUs already, don't know how well that translates). And not every task profits as much from faster ram. No idea if those are major reasons for amd and intel, but like for everything else it's just a matter of finding what best fits a job.
@dotted1337
@dotted1337 2 жыл бұрын
On-die RAM is rather limiting, so it wont really work well for either AMD or Intel to make such a product as such kind of RAM is much too slow, in terms of both bandwidth and latency, for use as a cache or if used as RAM you'd have the same problem as this video is talking about. But Intel had the i7-5775C back in 2015 with 128MB of EDRAM for the on board GPU, but was also used as a L4 cache, and Intel's upcoming Sapphire Rapids Xeon will have a version with 64GB on-package HBM2E with a bandwidth of well over 1TB per second. And finally you have AMD with their V-Cache supposedly having a bandwidth of about 2TB per second. tl;dr Apple can do on-die memory because they know exactly who their customers are and can make almost tailor made SoCs for them, where as AMD and Intel has customers much too diverse to make on-die memory viable.
@JohnnyWednesday
@JohnnyWednesday 2 жыл бұрын
@@dotted1337 - Thank you for your detailed reply, I was unaware of the I7-5775C - that smells like it could have been designed for use in a console given the perceived similarity to previous xbox memory layouts. It is my understanding that a large part of the M1s 'boost' above other ARM designs is the lower latency access to system memory? Perhaps naive but if such performance can be gained for an ARM chip, then should not a similar ratio of performance be seen with a similarly designed x86 chip? With ultra-fast streaming devices and multi-channel pardigms like the PS5's SSD controller? could we not see a slowing of average memory capacity for users? perhaps the time for a fixed 16gb of memory on a CPU is now? especially given the console generations are locking game engine technology advancements for years at a time?
@harshpatel9020
@harshpatel9020 2 жыл бұрын
I think this is because they uses DDR in their desktop models (and not laptops because laptop come in both)and not lpddr as used in apple's M1 line up. In mobile processor where DDR and LPDDR , both are being used - ram is mounted on the pcb.(these are soldered on motherboard and not on die itself as you said is in the case of apple) Note - many things I said may turn out to be wrong so it will be better if one cross checkes things first before getting any conclusion.I would be happy to know where I am wrong and Learn something new. Thank you)
@mytech6779
@mytech6779 2 жыл бұрын
On die memory is called L1 cache, sometimes L2 and L3 care often placed on die as well. In fact over 80% of late generation CPU silicon area is taken up by on-die memory. (NB4: yes the 386 had off-die L1, but it was 1986)
@Yoda2000ful
@Yoda2000ful 2 жыл бұрын
Amazing, I wish I had a teacher like you on the microprocessor classes at my degree❤️
@Benny-tb3ci
@Benny-tb3ci 2 жыл бұрын
We, the people in chemistry and any other science that relies heavily on chemistry, have a very nice phrase for these kinds of things. It's called the "rate-limiting step" (in a chain of reactions).
@SimonJentzschX7
@SimonJentzschX7 2 жыл бұрын
Great video. I learned something new! Just one question: Could the operating system optimize my code when exexcuting? So when I allocate memory, the OS should know which CPU this process is running and allocate the memory in a RAM faster to access. This way the code does not need to change, just the OS.
@mr_waffles_the_dog
@mr_waffles_the_dog 2 жыл бұрын
OS's already tend to do this :D The problem is what happens when you have multithreaded code (e.g. running on multiple cores/cpus at once), there is no one ideal block of memory for the OS to allocate to. The Apple claim is that their system is non-NUMA, or at least sufficiently fast to be indistinguishable, so developers don't have to rearchitect things to maximize performance.
@kriptofinans2864
@kriptofinans2864 2 жыл бұрын
Very clear thx :)
@wile123456
@wile123456 2 жыл бұрын
Maybe you've done it before but I would love a video explaining video games vs rendering/productivity workloads. Games get a big performance boost with more cache, the 5800X3D 8 core cpu increased performance a lot from over doubling level 3 cache with 3D stacking. But why does it mostly only benefit games and not other workloads?
@circuitgamer7759
@circuitgamer7759 2 жыл бұрын
Video idea (because I don't know where to look for this) - some of the finer details of caching implementation. I understand the idea behind caching, and the structure behind it, but not how it's actually implemented. I want to learn the actual control logic for reading/writing cache lines, and when and how it gets updated to/from RAM or a higher level cache. Do the CPU cores control the caches directly, or is there some control logic for each cache that isn't a part of a specific core? I think it would be an interesting video, but if there's already one that exists that I missed, can someone reply with a link? I've only been able to find high-level explanations so far.
@IceMetalPunk
@IceMetalPunk 2 жыл бұрын
Apple: "M1 ULTRA FUSION!" Reality: "It's a fast wire junction."
@G5rry
@G5rry 2 жыл бұрын
Reality: No, it's a bit more than that.
@RunForPeace-hk1cu
@RunForPeace-hk1cu 2 жыл бұрын
If it’s so easy everyone would make 10TB/s interconnect 😂 It’s a lot more complex than that.
@giornikitop5373
@giornikitop5373 2 жыл бұрын
@@RunForPeace-hk1cu it IS actually fairly straightforward to make a 10TB/s interconnect. but the cost is beyond crazy. besides, your need a cpu of such power to take advantage of it, so the cost makes even less sense. so the reason is not they cannot make it, the reason is they don't need to, at least not yet.
@mysteriousm1
@mysteriousm1 2 жыл бұрын
Was there an earthquake during filming or why is it so shaky?
@henrikjensen3278
@henrikjensen3278 2 жыл бұрын
Good explanation, but I would like some explanation about write/read, i.e. two threads reading and writing to the same memory location. This would be easy enough to handle between the two sides, but what two cpus on the same side with their own cache, it sounds like a lot of circuit to handle that. Are there some smart solutions?
@ClarkCox
@ClarkCox 2 жыл бұрын
That is indeed a problem that must be contended with. Look up "cache coherence"
@kirtanmusica1999
@kirtanmusica1999 2 жыл бұрын
Namaskar gracias por la educación, gracias por la luz
@SproutyPottedPlant
@SproutyPottedPlant 2 жыл бұрын
That was great! When you showed the bus arbiter it reminded me of the Sega Mega Drive! It’s got one of those??
@roryskyee
@roryskyee 2 жыл бұрын
very technical
@bumbixp
@bumbixp 2 жыл бұрын
Doesn't the OS scheduler largely handle this? Even if you make a single threaded app, Windows will move it around on different cores but it stays within the same NUMA node.
@Pyroblaster1
@Pyroblaster1 2 жыл бұрын
Lets say you allocate a buffer and load data into memory in a single thread and then start many threads to process that data, which is perfectly reasonable and usual way to do things with uniform memory access. Then if you saturate the system with threads, half or more of the threads will run on NUMA nodes that are different from where the data buffer was allocated, incuring the longer access times. You have to explicitly handle the allocation and data loading so that the data is distributed in a way that the threads processing each part of data are on the same NUMA node as the data they are processing.
@Piktogrammdd1234
@Piktogrammdd1234 2 жыл бұрын
Yes and no. There are mitigations on every level to compensate for problems, but every solutions is just bad in comparison to an idealistic system with endless Memory, zero latency, no collisions. OS schedulers try to localize data and corresponding processes, but limits are still there. Every time processes on a node are in need for more memory than available locally or processes are relocated to other nodes will be problematic.
@ivanskyttejrgensen7464
@ivanskyttejrgensen7464 2 жыл бұрын
The OS tries to handle this, but it's not perfect. Eg. last time I dealt with this the OS tried serve memory allocations from the nearest memory, but wouldn't move it around afterwards. So we ended up using processor sets to direct processes to be started at the "right" part of the CPUs so the subsequent memory allocations could all be served from the local memory. That gave a 10-15% speedup compared to leaving it to the OS to figure things out.
@kelvinluk9121
@kelvinluk9121 2 жыл бұрын
is it possible to address the ram access conflict issue between different cpus by introducing more memory channels?
@tcornell05
@tcornell05 Жыл бұрын
This might be the most informative video i've come across in years on youtube. You have an amazing way of articulating topics like this to the ADHD & Dyslexic programming community, like myself xD. Now I'm dying for a fellow up on how exactly they managed to make the distributed shared memory link so fast. Any resources you recommend?
@X_Baron
@X_Baron 2 жыл бұрын
Ultra Fusion is basically Blast Processing, but more extreme and rad.
@tomahzo
@tomahzo 2 жыл бұрын
Nice video! One question would be how much of this is done purely in hardware and how much is informed by the compilers, system frameworks and the OS as a whole. Apple has the advantage that they build the hardware and the full OS stack whereas players like Intel and AMD cannot pick and choose between what OS:es they want to support so whatever they do must be fully realized in hardware. (although, the OS vendors do need to support the hardware features that they offer) So does this mean that Apple uses some system software tricks to accelerate the interconnect and to reduce the latency, maybe through the way that the CPUs and their associated memory are partitioned and how threads are scheduled across the cores to minimize traffic through the interconnect?
@DalasYoo
@DalasYoo 2 жыл бұрын
CCIX Hooray!
@andredejager3637
@andredejager3637 2 жыл бұрын
wow thanks 😊
@edmondhung6097
@edmondhung6097 2 жыл бұрын
But what is more important in this NUMA case? latency or bandwidth? And to push the performance to absolute limit, is it still better to use local memory instead of remote even apple promoted the interconnection have more bandwidth than its memory bandwidth
@RAJATTHEPAGAL
@RAJATTHEPAGAL 2 жыл бұрын
Another hypothesis is Apple's Rosetta , layer possibly working to tranlating instruction to accomodate the memory layout. Perhaps tapping in between the OS Kernel level calls and application layer to translate the memory allocation and instruction placement to be co-located in the same memory. I mean Roseta emulation is fast , won't be surprised if they use it for this purpose. won't be a silver bullet but a bullet they may add for solving the memory placement issue.
@magicmark3309
@magicmark3309 2 жыл бұрын
I wouldnt think so. Rosetta only installs once you install software that can’t natively run on M1. I think that’d be adding too much overhead to an already somewhat costly translation layer. Although I’ve seen it really depends on the particular software. It also helps that Apple has a very large piggy bank for their RND and that they plan everything so far off. Hence why iPhones are just now getting high refresh rates. Hopefully this will give new life to competition I. The market.
@newburypi
@newburypi 2 жыл бұрын
Think I missed something here. Totally got the "was slow but Apple made it fast." However I think there's a promise of "won't need to change the software." The NUMA method requires knowledge of which memory block has the desired data. Hence, a change to software. So... did they also build a way to hide the fact of two memory blocks?
@elliott8175
@elliott8175 2 жыл бұрын
The reason NUMA systems usually require the software developers to be aware of the positioning of CPUs and memory is because of the slower speeds when fetching data from memory that is farther away. However, the new M1 chip claims to make fetching data fast enough for the worst-case RAM position to still not cause any slow-down. I assume this means that the difference in time to fetch memory that is close, compared to memory that is far away, is less than a clock cycle. So from the core's point-of-view they have the same latency.
@newburypi
@newburypi 2 жыл бұрын
@@elliott8175 great. Thanks for the clarification. Thought I missed something.
@RunForPeace-hk1cu
@RunForPeace-hk1cu 2 жыл бұрын
@@elliott8175 the “trick” is literally the hardest part that no one could solve 😂
@Xiaomi_Global
@Xiaomi_Global 2 жыл бұрын
How about the same architecture but different fab interconnect process? Does it affect performance?
@asmerhamidali9679
@asmerhamidali9679 2 жыл бұрын
Please make some videos on RISC-V. Lately it has been a hot topic.
@michaellatta
@michaellatta 2 жыл бұрын
I would guess ram attached to each die and cache is on that die. Interconnect used for off-die access to the other die’s cache/ram.
@jaffarbh
@jaffarbh 2 жыл бұрын
Things get more complicated as we use cloud based virtual machines and we have no idea (often) about the underlying hardware and architecture. If I recall, VMware hypervisor dynamically reallocates memory blocks to optimise for more (localised) access so that software developers don't need to worry about it.
@marklonergan3898
@marklonergan3898 2 жыл бұрын
Maybe i'm not understanding the problem correctly, but couldn't you just have a rudementry controller sitting between the 2 that uses the most-significant bit of the address to determine which ram chip has the data? That way by having the controller between the chips and as the central access point, all queries would take the same amount of time to fetch the data. By having this logic at hardware level you would have minimal latency added. I know this would only work on chips that are the same size but you could combine composites with singles (i.e. 2x 32s connected with a controller could be combined with an actual 64 with a controller)
@Addlibs
@Addlibs 2 жыл бұрын
This suffers the same slowdown which result from physically separate RAM locations, close to individual groups of CPU cores but not as close to others; even if the most significant bits picked the RAM module without any fancy chips in the way, fetching data from a CPU farther down the line is going to be generally slower, and it's easy to double or triple the tiny amount of time it takes to fetch data with computers this compact and fast, that is, 4 nanoseconds is twice as long as 2 nanoseconds -- both are incredibly fast though.
@katbryce
@katbryce 2 жыл бұрын
@@Addlibs Remember that a 4GHz CPU completes 4 instructions every nanosecond, and in a nanosecond, light travels about 30cm. Electricity is slower, so any round trip of more than about 3cm isn't going to happen within a clock cycle.
@torb-no
@torb-no 2 жыл бұрын
In the Fujitsu A64FX is the CMG (Core Memory Group) like one of these groups talked about in the video? So if you’re on one node in one of them, trying to get data from memory connected to another CMG will be slower?
@radutopor8389
@radutopor8389 2 жыл бұрын
I still don't get why splitting the RAM in two wouldn't cause the same collisions problem with the high number of CPUs, given they effectively still share just one bus, albeit connected by some black box in the middle.
@YeOldeTraveller
@YeOldeTraveller 2 жыл бұрын
Because the two NUMA regions are separate, any access in one region does not impact access in another region. Even without coding for it, you reduce the likelihood of collision.
@1idd0kun
@1idd0kun 2 жыл бұрын
No matter how fast the interconnect is, it's never gonna behave like a UMA system. If a core in die 1 tries to access the memory pool attached to die 2, there will be a latency penalty. We won't know how big that latency penalty is and how much of an impact in performance will have until the system is properly tested. I'm hoping Anandtech will test it since they usually do memory latency tests.
@bobo-cc1xw
@bobo-cc1xw 2 жыл бұрын
Ian cutruss formerly of anandtech said above 5 to 7 NS for just interconnect Vs 54ns total. So call it 15 percent more latency
@steve1978ger
@steve1978ger Жыл бұрын
If I were to guess how they did this, I'd say they've made their memory bus expandable in the first place, like having an extra bit on the address bus etc.
@genhen
@genhen 2 жыл бұрын
I've always wondered if we accessed more than one NUMA nodes worth of memory, how does the memory get chunked up? Take half and half? Take most from one? Is it hardware dependent? Software/OS dependent?
@katbryce
@katbryce 2 жыл бұрын
On my Threadripper motherboard, there is the CPU, and either side of it, there are four memory slots for a total of 8. The 4 slots on one side are one NUMA node, and the 4 slots on the other side are the other NUMA node.
@PoseidonDiver
@PoseidonDiver 2 жыл бұрын
Also, there is no true virtual to physical CPU affinity. And the hypervisor generally allocates the compute to the VM as needed, when running performance graphs you can see big spike across the sharing CPUs when its allocating compute from another node. (hope that actually answers your question :p )
@sevilnatas
@sevilnatas 2 жыл бұрын
Does Computerphile often use greenbar paper for their illustrations, because they are still using greenbar a lot, so it is handy, or is it because they don't use it anymore, so they have a bunch of it sitting around, unused, so they might as well use it for illustrations?
@debojitmandal8670
@debojitmandal8670 2 жыл бұрын
Wait but apple isnt using a distributed shared memory Like u mentioned. But rather a cpu from one block can access the memory of other cpu block directly without even going through the distributed shared memory lane atleast that what i have understood from their presentation. There is no middle man like the shared distributed memory lane as u mentioned. Please correct me if i am wrong
@5urg3x
@5urg3x 2 жыл бұрын
I very clearly remember the days of dual socket (like multiple physical CPUs with their own memory) workstations. It looked cool on paper, but in the real world, it usually didn't work out very well. Many times, even with software optimizations, it was more efficient (and simpler logistically) to just use one physical processor, rather than to attempt to have them both working together on the same task or set of tasks, and having to swap data in and out of cache and memory, etc. For servers, it could work, but most workstation workloads just aren't going to benefit from that type of an architecture.
@AntiWanted
@AntiWanted 2 жыл бұрын
Nice
@jurabondarchook2494
@jurabondarchook2494 2 жыл бұрын
Hmmm. But if you make distributed shared memory system super fast, you will end up with the same problem as in the beginning. When distributed shared memory system need to access memory, CPUs attached to that memory have to wait, aren't they? So probability of collision increases again.
@JCBOOMog
@JCBOOMog 2 жыл бұрын
Hi steve
@BR-lx7py
@BR-lx7py 2 жыл бұрын
Can the operating system take care of always allocating memory from the block that is closer to where the process that is requesting it is running? I know it's not perfect, but would work 90% of the time
@moritzhedtke8139
@moritzhedtke8139 2 жыл бұрын
Linux actually does as far as I know
@ssvis2
@ssvis2 2 жыл бұрын
To a certain extent, yes the OS can. However, in order to effectively to that, it needs some information about memory requirements and usage patterns for a process. Some can be gleaned from the raw byte code, especially if there are hints placed by the programmers, but a lot will come from actually running the process, then dynamically remapping and moving memory as needed. It'll work better for long-running processes, but is by no means optimal. That's why most super high performance programs, such as many video games, manually set CPU core affinity and utilize custom memory allocators to provide direct control over memory locality. They'll even go as far as detecting which are fast and slow cores and prioritize from there.
@shunyaatma
@shunyaatma 2 жыл бұрын
Yes, the Linux kernel can take care of this. The default memory policy (MPOL_DEFAULT) makes the page allocator always try to allocate memory from the local node but if that's not possible, it uses a different node. Over time, even if pages get scattered across NUMA nodes, Automatic NUMA Balancing will either try to move the pages to the node from where they were accessed the most or try to move the program itself to run on a CPU that is close to the memory that it accesses the most.
@jfmezei
@jfmezei 2 жыл бұрын
Great to find someone who remembers NUMA !! BTWk you forgot to deal with cache coherence. Core 1 modifying contents at a memory location that is also in core 2's cache. In the 1990s, Digital tried to scale its Alpha computers to have many cores with its Wildfire class machines. They found that 4 cores was the max the memory controller could handle before performance increments stopped beingf interestiung. So they created the Wildfires with 4 CPU "QBB" that were boards, connected by what Digital called a switch. NUMA access between these QBBs was atrocious. This was dealth with at the operating system level, less so at application. You could pre-load shareable images onto a specific QBB and then launch processes that use them on that QBB so they would use local memory for shareable images etc. But this was nowhere enough. Digital then worked on the next generation alpha the EV7 which was delayed as long as they could because Compaq/HP who had bought Digital didn't want EV7 to beat the pants off the Intel Itanium heat generator. The EV7 introduced a totally new memory controller that remained state of the art beyond the death of Alpha. HP donated Alpha IP to Intel which used it for its CSI interconnect (later called Quickpath) and which evolved from there. ex-Alpha engineers went to AMD who developped their own version, and many ex-Alpha engineers formed PA-Semiconductors which was purchased by Apple to create its own ARM chips. The EV7 had coherent cache (and I beleliev only IBM's Power had this until AMD matched it. Intel's Quickpath did not implemnent coherent cache initially (despite having all the IP from DEC). If you google for Alpha Wildfire NUMA, you will find a result "Optimizing for Performance on Alpha Systems - Semantic..." by Norm Lastovica. It provides some then ciurrent memory accesses showing differences between direct and NUMA accesses in the Wildfires. But at page 26 also provides the EV7 memory archicteeture in a fabric. (21364 is the EV7 CPU, the first generation was 21064). Each CPU controlled a part of RAM. But because CPU 1 could request memory from CPU2 at same time as CPU3 requested from CPU4, CPU5 from 6 etc, it ended up having huge performance advantage when scaling number of cores. There was also an issue of CPU speed vs memory speed. Alpha came to surpass memory speed easily hence the 4 core limit Digital found in the 1990s. But when you increase memory speed (and it has increased tremendously since then), it lets you increase number of cores that have direct access (especially in last littel while when "Moore's Law" was more about adding cores than making each core faster. Before their death, Digital engineers would present at DECUS comferences and provide much information about Alpha advancements and how they improved thinsg etc. It is a real shame that Apple hides all the real information and only rpovides marketing gobledeegook that is useless.
@andybaldman
@andybaldman 2 жыл бұрын
Nobody cares, man.
@RogerBarraud
@RogerBarraud Жыл бұрын
@@andybaldman You are wrong on the Internet.
@andybaldman
@andybaldman Жыл бұрын
@@RogerBarraud Nope you are
@TheOisannNetwork
@TheOisannNetwork 2 жыл бұрын
Nice lights 😉
@peterhindes56
@peterhindes56 Жыл бұрын
Why have a memory interconnect at all then? Unless this was not intended to solve the problem mentioned about memory access getting clogged up.
@caffedinator5584
@caffedinator5584 2 жыл бұрын
My naive understanding of cpu architecture leads me to believe that the core to core memory interconnect is the lesser of the problem vs the GPU core kernel/instruction execution. Do you have any insight into that?
@hoagy_ytfc
@hoagy_ytfc 2 жыл бұрын
Given what they claim for "unified memory", the GPUs should be factored into this description, IMO
@centerfield6339
@centerfield6339 2 жыл бұрын
I don't really understand this - if the NUMA architecture lets you access the other memory as fast as local CPUs, then doesn't the original contention problem become an issue again? I thought that's what the video would end with, given it was teed up like that.
@autohmae
@autohmae 2 жыл бұрын
I wonder if Linux scheduler already has a variable for the latency, so no new code is needed. My guess would be yes.
@johongo
@johongo 2 жыл бұрын
I want to learn more about this stuff, but it seems very distant, even as someone who programs for work. Any advice?
@MrPBJTIME12
@MrPBJTIME12 2 жыл бұрын
Computer Organization & Architecture - William Stallings
@raeraeraeth
@raeraeraeth 2 жыл бұрын
That camera roll is making me heave
@Travisharger
@Travisharger 2 жыл бұрын
I love how he is using paper that looks like it came from my windows 3.1 computer back in 1994.
@TomekSw
@TomekSw 2 жыл бұрын
Great video. Came 10 years late for me. :( :)
@dustinmorrison6315
@dustinmorrison6315 2 жыл бұрын
Hopefully my programs are not fetching instructions from RAM often enough for it to matter. Hopefully they're somewhere in the L1i,2,3,4 caches.
@jbf81tb
@jbf81tb 2 жыл бұрын
I would like to know what the architecture of the GPU is like after seeing this. I believe GPUs have thousands of cores, but I don't think I see numbers for cache in their marketing materials, just memory, cores, and clock speeds.
@TDRinfinity
@TDRinfinity 2 жыл бұрын
I don't know about GPUs with 100% certainty, but I design SOCs with hundreds of cores and we still have individual caches for each core, as well as shared cluster caches
@Edekje
@Edekje 2 жыл бұрын
It's quite interesting how GPUs work actually. They function in a completely different way to CPUs, preventing this problem. Each core in a multi-core CPU system functions as a completely independent entity, accessing whatever memory it needs to complete its calculations. Different cores working on separate tasks can therefore end up accessing the same memory, thereby slowing down each other's progress. The gist of what happens in a GPU is that the entire GPU is dedicated to doing just one single task. Each one of its cores executes exactly the same task, piece of code, in lockstep. The key difference here is that each core performs the same actions, but on different pieces of memory. Often these pieces of memory are adjacent. So a (sensible) GPU program will never have different cores trying to lock the same piece of memory simultaneously. The GPU has chopped up one big task into 1000s of equal bite-sized chunks. Hope that explanation helps!
@TDRinfinity
@TDRinfinity 2 жыл бұрын
@@Edekje so like vector/data level parallelism vs thread level parallelism?
@TDRinfinity
@TDRinfinity 2 жыл бұрын
@@Edekje like is a GPU usually just executing a single thread of vector instructions at a time, or can it split across clusters of execution units to execute different threads? I have really no GPU experience
@jbf81tb
@jbf81tb 2 жыл бұрын
@@Edekje Thank you, that is a helpful explanation. Is there some synchronizer that keeps the cores on task? I'm thinking like Tom, I know the GPU is basically a matrix multiplication fiend. Is that because it can easily distribute all the multiplications to a bunch of cores and then there's a synchronizer that can grab all those products and sum them together in the appropriate way to return the expected matrix? Rereading your answer, I'm wondering if it's like a pass-the-bucket situation. Like a core takes a bucket of memory from the core "to its left", it does some operation on that memory, and then hands it off to the core "on its right", looking left for its next bucket. Or is it more like "i've got 10 buckets and I need 10 cores to work on them", hands those off, and then next "I've got 37 buckets," hands them off, etc.
@vladomaimun
@vladomaimun 2 жыл бұрын
Does application software needs to be NUMA-aware or does the OS kernel handle everything NUMA related?
@JamesClarkUK
@JamesClarkUK 2 жыл бұрын
The OS could do scheduling to keep your application on one numa node. You can use numactl on Linux to tell the kernel what you want to happen
@RunForPeace-hk1cu
@RunForPeace-hk1cu 2 жыл бұрын
The whole point is it’s HW implementation and no software need to be changed.
@bmitch3020
@bmitch3020 2 жыл бұрын
Is this at least part of the reason motherboard instructions specify which slots should be used for various numbers of RAM chips?
@R3BootYourMind
@R3BootYourMind 2 жыл бұрын
No, the slots are numbered because of how ram is accessed in parallel. Dual channel memory is usually the maximum consumer cpus can handle and the dual part is electrically wired to work best with some slots. using the "wrong" slots would either make two ram sticks work in single channel mode or dual channel but in the longer traces. The slightly longer traces are the nonpreferred slots that are used when 4 sticks are in use and can effect memory overclocking results.
@qm3ster
@qm3ster 2 жыл бұрын
No CPU gets data "before it needs it" :v Going to main memory is really, REALLY slow (compared to anything else CPUs spend time doing these days). So, are any cache layers shared between the chiplets?
@dos350
@dos350 2 жыл бұрын
numa baby!
@soundcheck6885
@soundcheck6885 2 жыл бұрын
If you have a massive high-speed low-latency die-to-die interconnect between two dies, accessing the memory on the other die could have a latency penalty of only 10-20ns. In a system using multi-level cache hierarchy, having 10-20% extra latency for remote memory access is irrelevant for most applications. What would be more interesting is if Apple can extend the same interconnect architecture to higher numbers of CPU clusters (e.g. 4 or 8) and how much extra latency would be involved for memory access in those cases. One interesting option to achieve this may be stacking dies vertically in addition to the planar configuration in M1 Ultra.
@TheGTP1995
@TheGTP1995 2 жыл бұрын
To be fair, most programmers wouldn't have had to care about how the memory worked anyway because the compiler would have done the job for them. This spared Apple the cost of writing new optimization code for the compiler, at the cost of more hardware engineering
@RunForPeace-hk1cu
@RunForPeace-hk1cu 2 жыл бұрын
You’ve never written kernel drivers have you? 😂
@K4nj
@K4nj 2 жыл бұрын
@@RunForPeace-hk1cu he clearly hasn't worked within a c++ development environment
Internet Congestion Collapse - Computerphile
20:16
Computerphile
Рет қаралды 92 М.
Log4J & JNDI Exploit: Why So Bad? - Computerphile
26:31
Computerphile
Рет қаралды 497 М.
$10,000 Every Day You Survive In The Wilderness
26:44
MrBeast
Рет қаралды 133 МЛН
La revancha 😱
00:55
Juan De Dios Pantoja 2
Рет қаралды 43 МЛН
Каха инструкция по шашлыку
01:00
К-Media
Рет қаралды 8 МЛН
What's Virtual Memory? - Computerphile
22:40
Computerphile
Рет қаралды 173 М.
Reverse Engineering - Computerphile
19:49
Computerphile
Рет қаралды 182 М.
Square & Multiply Algorithm - Computerphile
17:35
Computerphile
Рет қаралды 273 М.
Hacking Out of a Network - Computerphile
25:52
Computerphile
Рет қаралды 237 М.
Breaking RSA - Computerphile
14:50
Computerphile
Рет қаралды 353 М.
Mythical Man Month - Computerphile
17:18
Computerphile
Рет қаралды 136 М.
Discussing System On Chip (SoC) - Computerphile
14:53
Computerphile
Рет қаралды 197 М.
The end of Apple Silicon’s reign
7:51
Alex Ziskind
Рет қаралды 445 М.