This entry is pure gold. Please make more vids where the latest tech is a jumping off point for the main topic.
@Stopinvadingmyhardware2 жыл бұрын
Where did I apply for that?
@oskrm2 жыл бұрын
That's the thing, this is not the latest tech.
@nezbrun8722 жыл бұрын
NUMA's not new, it's been a facet of multi socket Xeon systems for many years for example, and other architectures too before that. The battle has always been to make the interconnect interfaces (QPI/UPI in Intel speak) as quick as possible to maximise performance. Software like RDBMSs are NUMA aware to optimise workload across sockets (and hence memory domains).
@darkidz242 жыл бұрын
It could really take this channel to the next level!! Explaining modern day tech
@SilentlyContinue2 жыл бұрын
Yes! Helps with understanding real world application.
@TechTechPotato2 жыл бұрын
Intel's EMIB, similar to ultra fusion, in Sapphire Rapids adds additional latency of 5-8 nanoseconds. This makes the core-to-core latency go from 54 worst case to 70 worst case. Apple's situation is similar, with similar bandwidth per connection. We expect the latency to be an additional 5-8 nanoseconds also. Ultrafusion is using TSMC's InFO_LSI manufacturing.
@Joseph_Roffey2 жыл бұрын
But the difference is one is called “random string of letters” and the other is called “Ultra Fusion” 😍
@eddyecho2 жыл бұрын
@@Joseph_Roffey huh? More like one is a "stupid marketing name that really doesn't describe the underlying mechanism" and the other is called "embedded multi-die interconnect bridge"
@landspide2 жыл бұрын
@@Joseph_Roffey And begins with "We call this..." and is filled with "... only at Apple can we ..."
@shunyaatma2 жыл бұрын
Any numbers for AMD (Zen 2 and 3) 2-socket systems with and without xGMI cables?
@egor1g2 жыл бұрын
yeah, but it is ARM vs x86, 256 channel memory against 6 and also efficiency cores, also video memory... so not really the same!
@NinjaAdorable Жыл бұрын
This has been one of the most intuitive and elegant explanations for NUMA I have ever heard!! Kudos
@prla54002 жыл бұрын
Back to you, Steve
@markholm70502 жыл бұрын
Can one still purchase green lined, perforated line printer paper or are you working off an old stock? That stuff was great for physics homework. Worked pretty well in line printers, too.
@sajukkhar2 жыл бұрын
Dot matrix paper is still sold.
@rabidbigdog2 жыл бұрын
I'm convinced there is warehouse in Nottingham that is full of nothing but that tractor paper, just for Computerphile.
@davidgillies6202 жыл бұрын
You can buy a couple of thousand feet of the green ruled stuff for about forty quid from any wholesale stationery supply store.
@arpanmajumdar6172 жыл бұрын
I think they are still available at Dunder Mifflin.
@heisen94602 жыл бұрын
@@arpanmajumdar617 lol
@paulledak2912 жыл бұрын
Nice explanation of how NUMA architecture is implemented. However, you stated that the reason for moving to this architecture is because as you add more and more cores, you increase the probability of memory collisions. But then you completely forgot to explain how having 2 memory banks reduces the probability of the memory collisions that you would still get as you add the more processors. It would seem to be the most essential element needed for this video which is completely missing. (Yes I understand that now there are 2 memory banks with twice the bus bandwidth but this is never explained. And there are different interleaved memory architectures which could increase the memory bandwidth without resorting to NUMA)
@bberakable2 жыл бұрын
Agree 100%
@mytech67792 жыл бұрын
Its not bandwidth at issue, simultaneous access is the issue, this allows the banks to be accessed in parallel. Its like using a network bridge to make two ethernet sub-nets. Which I just realized is a really outdated reference as nobody uses shared media networks anymore. But basically all computers on a subnet could hear all packets on that subnet as it was physically one solid wire, and as more nodes were added you would get more chance of collisions and congestion,(non-linear increase) so chop it in two with a bridge(like a filter of sorts) so only about half of the total traffic can be seen, because only packets addressed to the other subnet are passed through the bridge.
@Sandeep-cz7ls2 жыл бұрын
@@mytech6779 wait im still confused, how does this allow the banks to be accessed in parallel? is it due to the interconnect?
@valshaped2 жыл бұрын
@@Sandeep-cz7ls Each bank can be accessed by one CPU at a time More banks -> more CPUs at a time
@MaulikParmar2102 жыл бұрын
@@Sandeep-cz7ls to keep it simple in modern day CPUs or lets say CPU cluster - there's memory controller inside each CPU cluster that makes request on behalf of physical CPU die, but in numa, there are multiple clusters acting on it's own so there are multiple access point to access different or same memory banks by different cpus. When two controllers try to access same bank and location it is going to be parallel access and cause lot of data inconsistencies when read and write at same time from different CPUs, unless it is handled on software level so that software is aware of such architecture. OS knows memory space and kernel is generally responsible to make sure each cpus request is translated in proper order and proper physical location by making use of translation tables or other hardware means that boosts this process depending on what's available. In NUMA these are much complex as each node has to communicate and coordinate exactly what they need, that's where connecting febric comes in, which provides crucial functions to get data in and out of foreign clusters. Keep that in mind when we talk about software, it's mostly OS level softwares and not consumer APIs, as consumer APIs make abstraction of these traits, your software would never know or has to care, if it's running on 1 core, 4 core or 12 cores 2x CPU sockets, in the eye of usespace resources are unified, unless you want to optimise then ofc you can request system to allocate memory near resource, that's the job of OS to maintain and abstract hardware and allow controlled access via syscalls or driver APIs.
@as-qh1qq2 жыл бұрын
Why does making the interconnect (distributed shared memory) super-fast not bring back the original problem that we were trying to solve - increased memory access collision with increased CPUs? After all, if far away CPUs can access memory in nearly the same time as the nearby ones, how is it any different than just one memory with all near and far CPUs connected to it ?
@ssvis22 жыл бұрын
It probably would reintroduce the problem. However, I would suspect there is some trickery under the hood of the OS working with the hardware to optimize data locality to keep the data on the "near" memory for any core. It's possible that part of it is memory mapping in the data interconnect so that memory on the "far" chunk could still be viewed as local to a core, and the super fast interconnect effectively negates the performance penalty that a traditional NUMA system would have.
@samuie22 жыл бұрын
I agree that it was not super clear in the video. I think you could still have that issue. however, it happens half as often since you have 2 banks of memory.
@davidgillies6202 жыл бұрын
I would guess that it means you don't _have_ to tune data affinity (which makes development/deployment easier and therefore cheaper) but you _can_ if you want (which gives you the benefits of an optimised NUMA configuration).
@ssvis22 жыл бұрын
@@davidgillies620 I'm thinking the same thing. By optimizing specific parts of the system, Apple has theoretically designed something that will perform really well in 99% of use cases. There's always more performance to squeeze out, but with severely diminishing returns.
@gajbooks2 жыл бұрын
UltraFusion is really just a memory... Fusion. Their memory gets twice as fast since they have twice as many banks, they just need a way to combine the M1 chips so that both of them can use the other's memory at high speeds. There was probably some tradeoff with the memory controller or packaging which made them need 2x64 rather than having external 128 GB. I imagine their real Mac Pro replacement will have external memory and GPU.
@doctorpex68622 жыл бұрын
Netflix gains most of speed by "video is not available in your country"
@RegitYouTuber2 жыл бұрын
Favourite bit of this was the chaotic side-angle crash zoom - really compliments the desperate addition of “well of course it’s more complex than this, but” that seems necessary these days
@BenjyP.2 жыл бұрын
I read ml instead of m1 so I thought this would be a video of how the neural cores work. I would love a video on how to use the apple neural cores for machine learning as they already take up 20% space of the entire chip
@TheMrKeksLp2 жыл бұрын
Mordern CPUs are only Harvard architectures in the most pedantic classification. Instructions are still kept in main memory, they just have a separate level 1 instruction and data cache. Even level 2 and 3 are shared...
@jaopredoramires2 жыл бұрын
The camera and lighting on this one looks incredible
@KipIngram9 ай бұрын
It's worth noting that the PCI ports are usually also split into these two domains, so you want to take that into account as well.
@kuroexmachina2 жыл бұрын
this channel is gold. always has been
@sholinwright66212 жыл бұрын
Don’t you still have to write code to distribute the memory hits across the two memory banks or you just have the same multi core stalling effect mentioned earlier. The speed up was the ability to partition core memory fetches into two batches preventing all of the cores stalling trying to fetch from the same bank. Side note: I work on a radar with 11 cpu cards with an 88000 on each and 2 MB of local ram with the collection tied to 2 global memory cards with 8 MB.of ram. GRAM memory fetches are really expensive.
@petrilaakso79272 жыл бұрын
Excellent explanation of NUMA, excellent work🙏🏼
@JohnnyWednesday2 жыл бұрын
Thank you kindly Dr. Bagley for sharing your knowledge with us. I'm quite surprised that Intel and AMD have not yet pushed for on-die memory given the M1's impressive demonstration
@SimonVaIe2 жыл бұрын
It does have some negative consequences. More expensive to produce, not expandable, if one thing breaks the whole thing is broken. I also don't know how much expertise would be required in ram design/production (keep in mind that Apple is far bigger than intel, which is far bigger than AMD) seeing there is a very well established ecosystem of memory manufacturers (they do have quite extensive cache systems on their CPUs already, don't know how well that translates). And not every task profits as much from faster ram. No idea if those are major reasons for amd and intel, but like for everything else it's just a matter of finding what best fits a job.
@dotted13372 жыл бұрын
On-die RAM is rather limiting, so it wont really work well for either AMD or Intel to make such a product as such kind of RAM is much too slow, in terms of both bandwidth and latency, for use as a cache or if used as RAM you'd have the same problem as this video is talking about. But Intel had the i7-5775C back in 2015 with 128MB of EDRAM for the on board GPU, but was also used as a L4 cache, and Intel's upcoming Sapphire Rapids Xeon will have a version with 64GB on-package HBM2E with a bandwidth of well over 1TB per second. And finally you have AMD with their V-Cache supposedly having a bandwidth of about 2TB per second. tl;dr Apple can do on-die memory because they know exactly who their customers are and can make almost tailor made SoCs for them, where as AMD and Intel has customers much too diverse to make on-die memory viable.
@JohnnyWednesday2 жыл бұрын
@@dotted1337 - Thank you for your detailed reply, I was unaware of the I7-5775C - that smells like it could have been designed for use in a console given the perceived similarity to previous xbox memory layouts. It is my understanding that a large part of the M1s 'boost' above other ARM designs is the lower latency access to system memory? Perhaps naive but if such performance can be gained for an ARM chip, then should not a similar ratio of performance be seen with a similarly designed x86 chip? With ultra-fast streaming devices and multi-channel pardigms like the PS5's SSD controller? could we not see a slowing of average memory capacity for users? perhaps the time for a fixed 16gb of memory on a CPU is now? especially given the console generations are locking game engine technology advancements for years at a time?
@harshpatel90202 жыл бұрын
I think this is because they uses DDR in their desktop models (and not laptops because laptop come in both)and not lpddr as used in apple's M1 line up. In mobile processor where DDR and LPDDR , both are being used - ram is mounted on the pcb.(these are soldered on motherboard and not on die itself as you said is in the case of apple) Note - many things I said may turn out to be wrong so it will be better if one cross checkes things first before getting any conclusion.I would be happy to know where I am wrong and Learn something new. Thank you)
@mytech67792 жыл бұрын
On die memory is called L1 cache, sometimes L2 and L3 care often placed on die as well. In fact over 80% of late generation CPU silicon area is taken up by on-die memory. (NB4: yes the 386 had off-die L1, but it was 1986)
@mysteriousm12 жыл бұрын
Was there an earthquake during filming or why is it so shaky?
@JJ-fq3dh2 жыл бұрын
Great video, brings back memories of codiding on an sgi origin 2000 and irix
@salmiakki56382 жыл бұрын
*It's only the firsts two generations of threadripper CPUs that have 2 NUMA nodes. The last one and both generations of threadripper pro have unified the memory Access
@romevang2 жыл бұрын
Threadripper 2990wx has 4 NUMA nodes. 2950x i think has 2.
@salmiakki56382 жыл бұрын
@@romevang thanks, i though I remembered it was the same throughout the range
@user-cc8kb2 жыл бұрын
Great explanation. Thanks!
@danielsilva1582 жыл бұрын
Would’ve been good to touch on how this memory system interfaces with the gpu!!
@user-cx2bk6pm2f2 жыл бұрын
Finally!! I understand NUMA.. thank you !
@grahmn8862 жыл бұрын
Lesson of the day, Thanks as always Steve :)
@shaneclk98542 жыл бұрын
Excellent video
@Sierra-Whisky2 жыл бұрын
What an excellent explanation! And what a coincidence too. I tried to explain NUMA and the potential performance hog on the exact same day this video was published but obviously my explanation was nowhere near as clear as this one. 🤣 Thanks! I'll share it with my colleagues.
@aipsong2 жыл бұрын
Excellent, instructive video - thanks!
@tomdchi122 жыл бұрын
Doesn't Apple provide the compilers (and IDE) so couldn't they be baking in the modifications to the code that is required to manage the non-uniformness of memory access times? (Regardless, early benchmarks indicate that performance is scaling only a little short of linearly with the number of cores, so we can infer that memory access across the two halves of the "fused" CPU isn't creating major delays.)
@OscarBerenguerPV2 жыл бұрын
This was a great video
@SproutyPottedPlant2 жыл бұрын
That was great! When you showed the bus arbiter it reminded me of the Sega Mega Drive! It’s got one of those??
@SaiPhaniRam2 жыл бұрын
Excellent presentation .. Simple and easy to understand 👏
@vernonthomas65542 жыл бұрын
Love your channel.
@Derbauer2 жыл бұрын
Nicely explained!
@IceMetalPunk2 жыл бұрын
Apple: "M1 ULTRA FUSION!" Reality: "It's a fast wire junction."
@G5rry2 жыл бұрын
Reality: No, it's a bit more than that.
@RunForPeace-hk1cu2 жыл бұрын
If it’s so easy everyone would make 10TB/s interconnect 😂 It’s a lot more complex than that.
@giornikitop53732 жыл бұрын
@@RunForPeace-hk1cu it IS actually fairly straightforward to make a 10TB/s interconnect. but the cost is beyond crazy. besides, your need a cpu of such power to take advantage of it, so the cost makes even less sense. so the reason is not they cannot make it, the reason is they don't need to, at least not yet.
@jfmezei2 жыл бұрын
Great to find someone who remembers NUMA !! BTWk you forgot to deal with cache coherence. Core 1 modifying contents at a memory location that is also in core 2's cache. In the 1990s, Digital tried to scale its Alpha computers to have many cores with its Wildfire class machines. They found that 4 cores was the max the memory controller could handle before performance increments stopped beingf interestiung. So they created the Wildfires with 4 CPU "QBB" that were boards, connected by what Digital called a switch. NUMA access between these QBBs was atrocious. This was dealth with at the operating system level, less so at application. You could pre-load shareable images onto a specific QBB and then launch processes that use them on that QBB so they would use local memory for shareable images etc. But this was nowhere enough. Digital then worked on the next generation alpha the EV7 which was delayed as long as they could because Compaq/HP who had bought Digital didn't want EV7 to beat the pants off the Intel Itanium heat generator. The EV7 introduced a totally new memory controller that remained state of the art beyond the death of Alpha. HP donated Alpha IP to Intel which used it for its CSI interconnect (later called Quickpath) and which evolved from there. ex-Alpha engineers went to AMD who developped their own version, and many ex-Alpha engineers formed PA-Semiconductors which was purchased by Apple to create its own ARM chips. The EV7 had coherent cache (and I beleliev only IBM's Power had this until AMD matched it. Intel's Quickpath did not implemnent coherent cache initially (despite having all the IP from DEC). If you google for Alpha Wildfire NUMA, you will find a result "Optimizing for Performance on Alpha Systems - Semantic..." by Norm Lastovica. It provides some then ciurrent memory accesses showing differences between direct and NUMA accesses in the Wildfires. But at page 26 also provides the EV7 memory archicteeture in a fabric. (21364 is the EV7 CPU, the first generation was 21064). Each CPU controlled a part of RAM. But because CPU 1 could request memory from CPU2 at same time as CPU3 requested from CPU4, CPU5 from 6 etc, it ended up having huge performance advantage when scaling number of cores. There was also an issue of CPU speed vs memory speed. Alpha came to surpass memory speed easily hence the 4 core limit Digital found in the 1990s. But when you increase memory speed (and it has increased tremendously since then), it lets you increase number of cores that have direct access (especially in last littel while when "Moore's Law" was more about adding cores than making each core faster. Before their death, Digital engineers would present at DECUS comferences and provide much information about Alpha advancements and how they improved thinsg etc. It is a real shame that Apple hides all the real information and only rpovides marketing gobledeegook that is useless.
@andybaldman2 жыл бұрын
Nobody cares, man.
@RogerBarraud2 жыл бұрын
@@andybaldman You are wrong on the Internet.
@andybaldman2 жыл бұрын
@@RogerBarraud Nope you are
@acanalesc2 жыл бұрын
12:55 is that what is known as "Numa Aware"?
@genhen2 жыл бұрын
yes it is
@newburypi2 жыл бұрын
Think I missed something here. Totally got the "was slow but Apple made it fast." However I think there's a promise of "won't need to change the software." The NUMA method requires knowledge of which memory block has the desired data. Hence, a change to software. So... did they also build a way to hide the fact of two memory blocks?
@elliott81752 жыл бұрын
The reason NUMA systems usually require the software developers to be aware of the positioning of CPUs and memory is because of the slower speeds when fetching data from memory that is farther away. However, the new M1 chip claims to make fetching data fast enough for the worst-case RAM position to still not cause any slow-down. I assume this means that the difference in time to fetch memory that is close, compared to memory that is far away, is less than a clock cycle. So from the core's point-of-view they have the same latency.
@newburypi2 жыл бұрын
@@elliott8175 great. Thanks for the clarification. Thought I missed something.
@RunForPeace-hk1cu2 жыл бұрын
@@elliott8175 the “trick” is literally the hardest part that no one could solve 😂
@Benny-tb3ci2 жыл бұрын
We, the people in chemistry and any other science that relies heavily on chemistry, have a very nice phrase for these kinds of things. It's called the "rate-limiting step" (in a chain of reactions).
@ipurelike2 жыл бұрын
thanks for the technical explanation!
@dembro272 жыл бұрын
Cool stuff. But now I have "Numa Numa" in my head...
@kanishk94902 жыл бұрын
Yeah me too.
@qwertypnk94012 жыл бұрын
Nice, good job!
@Ojisan6422 жыл бұрын
Was this filmed on board a ship at sea?
2 жыл бұрын
😭😭
@AL-vc9xc2 жыл бұрын
Wow very well and simply explained. Not in a math profession. But I did understand this write well! Thank you!!
@JCBOOMog2 жыл бұрын
Hi steve
@X_Baron2 жыл бұрын
Ultra Fusion is basically Blast Processing, but more extreme and rad.
@wile1234562 жыл бұрын
Maybe you've done it before but I would love a video explaining video games vs rendering/productivity workloads. Games get a big performance boost with more cache, the 5800X3D 8 core cpu increased performance a lot from over doubling level 3 cache with 3D stacking. But why does it mostly only benefit games and not other workloads?
@bosco45332 жыл бұрын
I love this channel. /message.
@Xiaomi_Global2 жыл бұрын
How about the same architecture but different fab interconnect process? Does it affect performance?
@1idd0kun2 жыл бұрын
No matter how fast the interconnect is, it's never gonna behave like a UMA system. If a core in die 1 tries to access the memory pool attached to die 2, there will be a latency penalty. We won't know how big that latency penalty is and how much of an impact in performance will have until the system is properly tested. I'm hoping Anandtech will test it since they usually do memory latency tests.
@bobo-cc1xw2 жыл бұрын
Ian cutruss formerly of anandtech said above 5 to 7 NS for just interconnect Vs 54ns total. So call it 15 percent more latency
@SimonJentzschX72 жыл бұрын
Great video. I learned something new! Just one question: Could the operating system optimize my code when exexcuting? So when I allocate memory, the OS should know which CPU this process is running and allocate the memory in a RAM faster to access. This way the code does not need to change, just the OS.
@mr_waffles_the_dog2 жыл бұрын
OS's already tend to do this :D The problem is what happens when you have multithreaded code (e.g. running on multiple cores/cpus at once), there is no one ideal block of memory for the OS to allocate to. The Apple claim is that their system is non-NUMA, or at least sufficiently fast to be indistinguishable, so developers don't have to rearchitect things to maximize performance.
@nameunknown0072 жыл бұрын
Love you man!
@marklonergan38982 жыл бұрын
Maybe i'm not understanding the problem correctly, but couldn't you just have a rudementry controller sitting between the 2 that uses the most-significant bit of the address to determine which ram chip has the data? That way by having the controller between the chips and as the central access point, all queries would take the same amount of time to fetch the data. By having this logic at hardware level you would have minimal latency added. I know this would only work on chips that are the same size but you could combine composites with singles (i.e. 2x 32s connected with a controller could be combined with an actual 64 with a controller)
@Addlibs2 жыл бұрын
This suffers the same slowdown which result from physically separate RAM locations, close to individual groups of CPU cores but not as close to others; even if the most significant bits picked the RAM module without any fancy chips in the way, fetching data from a CPU farther down the line is going to be generally slower, and it's easy to double or triple the tiny amount of time it takes to fetch data with computers this compact and fast, that is, 4 nanoseconds is twice as long as 2 nanoseconds -- both are incredibly fast though.
@katbryce2 жыл бұрын
@@Addlibs Remember that a 4GHz CPU completes 4 instructions every nanosecond, and in a nanosecond, light travels about 30cm. Electricity is slower, so any round trip of more than about 3cm isn't going to happen within a clock cycle.
@yashkumarsingh97132 жыл бұрын
10:39 Why does the cpu of one ram needs to go to the other ?
@circuitgamer77592 жыл бұрын
Video idea (because I don't know where to look for this) - some of the finer details of caching implementation. I understand the idea behind caching, and the structure behind it, but not how it's actually implemented. I want to learn the actual control logic for reading/writing cache lines, and when and how it gets updated to/from RAM or a higher level cache. Do the CPU cores control the caches directly, or is there some control logic for each cache that isn't a part of a specific core? I think it would be an interesting video, but if there's already one that exists that I missed, can someone reply with a link? I've only been able to find high-level explanations so far.
@henrikjensen32782 жыл бұрын
Good explanation, but I would like some explanation about write/read, i.e. two threads reading and writing to the same memory location. This would be easy enough to handle between the two sides, but what two cpus on the same side with their own cache, it sounds like a lot of circuit to handle that. Are there some smart solutions?
@ClarkCox2 жыл бұрын
That is indeed a problem that must be contended with. Look up "cache coherence"
@gorunmain2 жыл бұрын
This is great!
@RAJATTHEPAGAL2 жыл бұрын
Another hypothesis is Apple's Rosetta , layer possibly working to tranlating instruction to accomodate the memory layout. Perhaps tapping in between the OS Kernel level calls and application layer to translate the memory allocation and instruction placement to be co-located in the same memory. I mean Roseta emulation is fast , won't be surprised if they use it for this purpose. won't be a silver bullet but a bullet they may add for solving the memory placement issue.
@magicmark33092 жыл бұрын
I wouldnt think so. Rosetta only installs once you install software that can’t natively run on M1. I think that’d be adding too much overhead to an already somewhat costly translation layer. Although I’ve seen it really depends on the particular software. It also helps that Apple has a very large piggy bank for their RND and that they plan everything so far off. Hence why iPhones are just now getting high refresh rates. Hopefully this will give new life to competition I. The market.
@vladomaimun2 жыл бұрын
Does application software needs to be NUMA-aware or does the OS kernel handle everything NUMA related?
@JamesClarkUK2 жыл бұрын
The OS could do scheduling to keep your application on one numa node. You can use numactl on Linux to tell the kernel what you want to happen
@RunForPeace-hk1cu2 жыл бұрын
The whole point is it’s HW implementation and no software need to be changed.
@TheGTP19952 жыл бұрын
To be fair, most programmers wouldn't have had to care about how the memory worked anyway because the compiler would have done the job for them. This spared Apple the cost of writing new optimization code for the compiler, at the cost of more hardware engineering
@RunForPeace-hk1cu2 жыл бұрын
You’ve never written kernel drivers have you? 😂
@K4nj2 жыл бұрын
@@RunForPeace-hk1cu he clearly hasn't worked within a c++ development environment
@tcornell052 жыл бұрын
This might be the most informative video i've come across in years on youtube. You have an amazing way of articulating topics like this to the ADHD & Dyslexic programming community, like myself xD. Now I'm dying for a fellow up on how exactly they managed to make the distributed shared memory link so fast. Any resources you recommend?
@michaellatta2 жыл бұрын
I would guess ram attached to each die and cache is on that die. Interconnect used for off-die access to the other die’s cache/ram.
@jbf81tb2 жыл бұрын
I would like to know what the architecture of the GPU is like after seeing this. I believe GPUs have thousands of cores, but I don't think I see numbers for cache in their marketing materials, just memory, cores, and clock speeds.
@TDRinfinity2 жыл бұрын
I don't know about GPUs with 100% certainty, but I design SOCs with hundreds of cores and we still have individual caches for each core, as well as shared cluster caches
@Edekje2 жыл бұрын
It's quite interesting how GPUs work actually. They function in a completely different way to CPUs, preventing this problem. Each core in a multi-core CPU system functions as a completely independent entity, accessing whatever memory it needs to complete its calculations. Different cores working on separate tasks can therefore end up accessing the same memory, thereby slowing down each other's progress. The gist of what happens in a GPU is that the entire GPU is dedicated to doing just one single task. Each one of its cores executes exactly the same task, piece of code, in lockstep. The key difference here is that each core performs the same actions, but on different pieces of memory. Often these pieces of memory are adjacent. So a (sensible) GPU program will never have different cores trying to lock the same piece of memory simultaneously. The GPU has chopped up one big task into 1000s of equal bite-sized chunks. Hope that explanation helps!
@TDRinfinity2 жыл бұрын
@@Edekje so like vector/data level parallelism vs thread level parallelism?
@TDRinfinity2 жыл бұрын
@@Edekje like is a GPU usually just executing a single thread of vector instructions at a time, or can it split across clusters of execution units to execute different threads? I have really no GPU experience
@jbf81tb2 жыл бұрын
@@Edekje Thank you, that is a helpful explanation. Is there some synchronizer that keeps the cores on task? I'm thinking like Tom, I know the GPU is basically a matrix multiplication fiend. Is that because it can easily distribute all the multiplications to a bunch of cores and then there's a synchronizer that can grab all those products and sum them together in the appropriate way to return the expected matrix? Rereading your answer, I'm wondering if it's like a pass-the-bucket situation. Like a core takes a bucket of memory from the core "to its left", it does some operation on that memory, and then hands it off to the core "on its right", looking left for its next bucket. Or is it more like "i've got 10 buckets and I need 10 cores to work on them", hands those off, and then next "I've got 37 buckets," hands them off, etc.
@johongo2 жыл бұрын
I want to learn more about this stuff, but it seems very distant, even as someone who programs for work. Any advice?
@MrPBJTIME122 жыл бұрын
Computer Organization & Architecture - William Stallings
@sevilnatas2 жыл бұрын
Does Computerphile often use greenbar paper for their illustrations, because they are still using greenbar a lot, so it is handy, or is it because they don't use it anymore, so they have a bunch of it sitting around, unused, so they might as well use it for illustrations?
@itsMunchkin2 жыл бұрын
Waw! Powerfully explained.
@iammakimadog2 жыл бұрын
Thank you!
@genhen2 жыл бұрын
I've always wondered if we accessed more than one NUMA nodes worth of memory, how does the memory get chunked up? Take half and half? Take most from one? Is it hardware dependent? Software/OS dependent?
@katbryce2 жыл бұрын
On my Threadripper motherboard, there is the CPU, and either side of it, there are four memory slots for a total of 8. The 4 slots on one side are one NUMA node, and the 4 slots on the other side are the other NUMA node.
@PoseidonDiver2 жыл бұрын
Also, there is no true virtual to physical CPU affinity. And the hypervisor generally allocates the compute to the VM as needed, when running performance graphs you can see big spike across the sharing CPUs when its allocating compute from another node. (hope that actually answers your question :p )
@peterhindes562 жыл бұрын
Why have a memory interconnect at all then? Unless this was not intended to solve the problem mentioned about memory access getting clogged up.
@bumbixp2 жыл бұрын
Doesn't the OS scheduler largely handle this? Even if you make a single threaded app, Windows will move it around on different cores but it stays within the same NUMA node.
@Pyroblaster12 жыл бұрын
Lets say you allocate a buffer and load data into memory in a single thread and then start many threads to process that data, which is perfectly reasonable and usual way to do things with uniform memory access. Then if you saturate the system with threads, half or more of the threads will run on NUMA nodes that are different from where the data buffer was allocated, incuring the longer access times. You have to explicitly handle the allocation and data loading so that the data is distributed in a way that the threads processing each part of data are on the same NUMA node as the data they are processing.
@Piktogrammdd12342 жыл бұрын
Yes and no. There are mitigations on every level to compensate for problems, but every solutions is just bad in comparison to an idealistic system with endless Memory, zero latency, no collisions. OS schedulers try to localize data and corresponding processes, but limits are still there. Every time processes on a node are in need for more memory than available locally or processes are relocated to other nodes will be problematic.
@ivanskyttejrgensen74642 жыл бұрын
The OS tries to handle this, but it's not perfect. Eg. last time I dealt with this the OS tried serve memory allocations from the nearest memory, but wouldn't move it around afterwards. So we ended up using processor sets to direct processes to be started at the "right" part of the CPUs so the subsequent memory allocations could all be served from the local memory. That gave a 10-15% speedup compared to leaving it to the OS to figure things out.
@asmerhamidali96792 жыл бұрын
Please make some videos on RISC-V. Lately it has been a hot topic.
@scratchpad79542 жыл бұрын
I literally couldn't resist reading the title as Apple M1 Ultra and Numa Numa Yee as my exhausted brain saw the Dragostina Din Tei reference!
@bentationfunkiloglio2 жыл бұрын
Quite informative.
@5urg3x2 жыл бұрын
I very clearly remember the days of dual socket (like multiple physical CPUs with their own memory) workstations. It looked cool on paper, but in the real world, it usually didn't work out very well. Many times, even with software optimizations, it was more efficient (and simpler logistically) to just use one physical processor, rather than to attempt to have them both working together on the same task or set of tasks, and having to swap data in and out of cache and memory, etc. For servers, it could work, but most workstation workloads just aren't going to benefit from that type of an architecture.
@radutopor83892 жыл бұрын
I still don't get why splitting the RAM in two wouldn't cause the same collisions problem with the high number of CPUs, given they effectively still share just one bus, albeit connected by some black box in the middle.
@YeOldeTraveller2 жыл бұрын
Because the two NUMA regions are separate, any access in one region does not impact access in another region. Even without coding for it, you reduce the likelihood of collision.
@kirtanmusica19992 жыл бұрын
Namaskar gracias por la educación, gracias por la luz
@kelvinluk91212 жыл бұрын
is it possible to address the ram access conflict issue between different cpus by introducing more memory channels?
@steve1978ger2 жыл бұрын
If I were to guess how they did this, I'd say they've made their memory bus expandable in the first place, like having an extra bit on the address bus etc.
@PeterHarket2 жыл бұрын
Lovely content, but I have to say that the wobbly camera is quite annoying :/
@debojitmandal86702 жыл бұрын
Wait but apple isnt using a distributed shared memory Like u mentioned. But rather a cpu from one block can access the memory of other cpu block directly without even going through the distributed shared memory lane atleast that what i have understood from their presentation. There is no middle man like the shared distributed memory lane as u mentioned. Please correct me if i am wrong
@whatthefunction91402 жыл бұрын
Numa numa guy was way ahead of his time
@BR-lx7py2 жыл бұрын
Can the operating system take care of always allocating memory from the block that is closer to where the process that is requesting it is running? I know it's not perfect, but would work 90% of the time
@moritzhedtke81392 жыл бұрын
Linux actually does as far as I know
@ssvis22 жыл бұрын
To a certain extent, yes the OS can. However, in order to effectively to that, it needs some information about memory requirements and usage patterns for a process. Some can be gleaned from the raw byte code, especially if there are hints placed by the programmers, but a lot will come from actually running the process, then dynamically remapping and moving memory as needed. It'll work better for long-running processes, but is by no means optimal. That's why most super high performance programs, such as many video games, manually set CPU core affinity and utilize custom memory allocators to provide direct control over memory locality. They'll even go as far as detecting which are fast and slow cores and prioritize from there.
@shunyaatma2 жыл бұрын
Yes, the Linux kernel can take care of this. The default memory policy (MPOL_DEFAULT) makes the page allocator always try to allocate memory from the local node but if that's not possible, it uses a different node. Over time, even if pages get scattered across NUMA nodes, Automatic NUMA Balancing will either try to move the pages to the node from where they were accessed the most or try to move the program itself to run on a CPU that is close to the memory that it accesses the most.
@l.matthewblancett80312 жыл бұрын
WHERE DID YOU FIND THAT 1972 printer paper??!?!! lol.
@DalasYoo2 жыл бұрын
CCIX Hooray!
@MXDMND_2 жыл бұрын
Thank you
@jaffarbh2 жыл бұрын
Things get more complicated as we use cloud based virtual machines and we have no idea (often) about the underlying hardware and architecture. If I recall, VMware hypervisor dynamically reallocates memory blocks to optimise for more (localised) access so that software developers don't need to worry about it.
@MrMiryks2 жыл бұрын
why is the camera so wobbly and shaky? it is very irritating and causes that i look moost of the time away from the screen and just listening to the video like a podcast.
@autohmae2 жыл бұрын
I wonder if Linux scheduler already has a variable for the latency, so no new code is needed. My guess would be yes.
@roryskyee2 жыл бұрын
very technical
@kriptofinans28642 жыл бұрын
Very clear thx :)
@fernandoblazin2 жыл бұрын
when is the last time i saw that type of paper
@theobrominator2 жыл бұрын
My naive understanding of cpu architecture leads me to believe that the core to core memory interconnect is the lesser of the problem vs the GPU core kernel/instruction execution. Do you have any insight into that?
@bmitch30202 жыл бұрын
Is this at least part of the reason motherboard instructions specify which slots should be used for various numbers of RAM chips?
@R3BootYourMind2 жыл бұрын
No, the slots are numbered because of how ram is accessed in parallel. Dual channel memory is usually the maximum consumer cpus can handle and the dual part is electrically wired to work best with some slots. using the "wrong" slots would either make two ram sticks work in single channel mode or dual channel but in the longer traces. The slightly longer traces are the nonpreferred slots that are used when 4 sticks are in use and can effect memory overclocking results.
@pierreabbat61572 жыл бұрын
How do the CPUs handle it when two separate CPU chips, each with a cache, try to *write* to the same location? This can happen if the location is a mutex.
@katbryce2 жыл бұрын
This shouldn't happen. It does though, very frequently, and is the cause of most security vulnerabilities.
@edmondhung60972 жыл бұрын
But what is more important in this NUMA case? latency or bandwidth? And to push the performance to absolute limit, is it still better to use local memory instead of remote even apple promoted the interconnection have more bandwidth than its memory bandwidth
@hoagy_ytfc2 жыл бұрын
Given what they claim for "unified memory", the GPUs should be factored into this description, IMO
@TheOisannNetwork2 жыл бұрын
Nice lights 😉
@soundcheck68852 жыл бұрын
If you have a massive high-speed low-latency die-to-die interconnect between two dies, accessing the memory on the other die could have a latency penalty of only 10-20ns. In a system using multi-level cache hierarchy, having 10-20% extra latency for remote memory access is irrelevant for most applications. What would be more interesting is if Apple can extend the same interconnect architecture to higher numbers of CPU clusters (e.g. 4 or 8) and how much extra latency would be involved for memory access in those cases. One interesting option to achieve this may be stacking dies vertically in addition to the planar configuration in M1 Ultra.
@MattyHild2 жыл бұрын
Interesting video, but I’m afraid it doesn’t fully touch on how NUMA solves the bus contention issue from UMA type systems. Especially with the implication that you don’t need to program appropriately for a NUMA system in the M1 ultra. If you program agnostically to the non uniform memory, you effectively have a UMA system again. I get adding a second RAM bank boosts bandwidth but why not have a HBM style memory instead, especially stacked die HBM
@noenken2 жыл бұрын
Because of cost.
@torb-no2 жыл бұрын
In the Fujitsu A64FX is the CMG (Core Memory Group) like one of these groups talked about in the video? So if you’re on one node in one of them, trying to get data from memory connected to another CMG will be slower?
@dafoex2 жыл бұрын
All this is to say that people who think like the demoscene will write their programmes as if the M1 Ultra's distributed shared memory system was slow, just to squeeze out a little more speed