Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - Fedor G Pikus

Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - Fedor G Pikus - C++Now 2024

Рет қаралды 52,288

CppNow

Күн бұрын

Пікірлер: 65

@notapplicable7292 2 ай бұрын

I love these talks that take cpp closer to the hardware, it's far to easy to forget about memory

@skilz8098 2 ай бұрын

More than just memory. You have the pipeline mechanics, you have the branch predictor, you have the kernels page size and page table with virtual allocations and memory mappings, you have the various interrupts, you have the stack and heap mechanism within or of memory. Some things are Architect (ISA - ABI) dependent, others are Kernel (OS) depends, some things are Language and Compiler Dependent and finally some things are application specific or dependent. When one designs or builds a 3D Graphics Renderer - Engine from scratch using various APIs such as DirectX, OpenGL, Vulkan and their shader models, the bridge between hardware and software can and will make or break the efficiency and performance of your application. You can write an OpenGL or Vulkan Application in a generic sense but it's not going to operate exactly the same on an Intel compared to an AMD, ARM, MIPs, or a RISC-V CPU. This also doesn't even account for the various Motherboard or Mainboard vendors out there with their implementations of the various bus interfaces and memory management microprocessors embedded onto the main board that are separate from the CPU. An ASUS board is going to work similarly but differently than an MSI board or a Gigabyte board. We haven't even touched the topic of GPUs with the PCI Express lanes, nor did we even reach the point of software dependencies from various Operating Systems and compilers. It's not going to work the same on Windows compared to Linux or Mac, The MSVC compiler is going to translate your source code differently than GCC, Clang, Intel, and other compilers will, and these compilers are typically within the context of the C family of languages from C to C++ to C# to Python. When you start to work with other languages, compilers - interpreters - binaries such as with Fortran, Cobal, Haskel, and much more, it's a different ball game. Software on its own does absolutely nothing without hardware to operator it. And hardware does absolutely nothing without the software to instruct it. They are two different things but are codependently interlinked. If you don't know the hardware that you are trying to program or instruct, then how do you expect it to perform the operations you want it to do. A simple example: take a TI-83 graphics calculator. It has a bunch of functions to evaluate and to perform various mathematical functions. It expects the information in a specific format otherwise it will throw an error because it doesn't know how to handle it. It's in a "bad state" according to is internal design based on its implementation and specifications. This calculator even has its own user interface for one to write their own applications based on its internal language. Throughout the late 90 and early 2000s the widely taught consensus was to forget about the hardware and just know and accept that it works replacing it with layers upon layers of abstractions. Yet the entire time, that software still depends on the hardware and the hardware still depends on the software. For me, I've always seen this as a bad way of teaching it. Don't get me wrong, abstractions themselves aren't bad. We can abstract reoccurring patterns within Assembly instructions and are able to generate from them the mechanics to simply them into higher level languages such as C with various functionalities such as parameter and parameter passing between functions, methods, or routines, we can extrapolate out various loops such as for loops and while loops. We can extrapolate things out int things such as conditionals from if, if else, and switch statements, and so on. So, on the one hand abstractions are good in that it makes things easier and faster to do. On the other hand, if we begin to lose site on how those mechanisms are derived to give us those abstractions it is easy to get lost within those abstractions when a problem or an error occurs without being able to easily find or understand the core root of the problem. Personally, for me, I feel that if someone is going to be educated in the field of CS (Computer Science) for writing and developing software applications, they should also need to know and understand how the various hardware works at least down to the ISA level, at least to the instruction set level based on how the logical devices are interconnected. I wouldn't go as far as saying they would have to know how a transistor works based on their configurations as well as their physical and chemical properties. Yet it wouldn't hurt to know them either. For example: If you apply to much voltage with not enough resistance, and you don't know what your conductance, capacitance, or other electrical - electromagnetic properties are, and you don't know how much heat is going to be generated before that component exceeds its operational limits; then bad things are going to happen. The component can completely fail, it can arch and short circuit damaging other components, it can "melt" or catch fire or explode, and it can even cause physical harm. Here I'm not referring directly to just everyday basic ordinary computing. Consider a guidance system or a fuel intake system on a commercial jet airliner. If that component fails in the manner described above and let's say, there's 180 Souls on that plane. And that component failed because poorly written software exploited a flaw within it causing it to operate outside of its expected range of stable operations. And that short traveled to the fuel tanks, and the entire jet exploded. There's 180+ Souls that were lost. This is just one example. Personally, I think we need stop pushing a bunch of script kiddies out of the various universities into the real world and teach them every aspect of Computer Science for those who are taking up that major to become a professional in that field. Now of course, not everyone is going to be writing software for the airlines, but there are other industries that are just as vital. How about highway traffic control systems? How about electrical power plants? How about water and reclamation systems? And so much more! Okay sure, there are companies that sell products that have websites on the internet that hire people to write and manage their servers, databases, and websites and to that, it's a different field in my honest opinion. Web Development is not the same as Computer Science or Software Engineering. Sure, there are many things or topics shared between the two such as Algorithms, Containers and both Time and Space Complexities. However, there's a difference between the two venues when actual Lives are on the Line! What if one is charged or tasked with replacing a sensing monitor within a power transformer and that transformer is in a charged state, and the order of testing the nodes based on rotational ordering (3-Phase Power) ABC vs ACB vs BCA, etc. can make all of the difference between life and death while there's still a high voltage present within the given system. We tend to forget that Computer Science or software, etc. isn't just bound to the "micro world", it is also coupled with the "macro" scale of things too. Without that power transformer being in a good operational and safe state, those buildings on that portion of the grid won't have power, and all of your computer power won't mean a damn thing. LOL! So yeah. We shouldn't limit ourselves or our knowledge of things simply because we tend to think that they don't apply. They always apply!

@szaszm_ 3 ай бұрын

The modern CPU cache sizes look incorrect around 7 minutes. The beefiest EPYC Genoa with 1G+ L3 cache (9684X) has 6M overall L1 cache (64k/core) and 96M overall L2 cache (1M/core).

@rotors_taker_0h 2 ай бұрын

Yeah, he is definitely confused on that one.

@مقاطعمترجمة-ش8ث 2 ай бұрын

@@rotors_taker_0h outdated info, this happened a lot when you're not keep up on the tech development.

@обычныйчел-я3е Ай бұрын

@@مقاطعمترجمة-ش8ث it's not outdated, just incorrect

@srh80 17 күн бұрын

Yea. Its quite misleading and cavalier to just casually print "L1 - 3M/core" and then confirm very confidently when asked. Prolly used chatgpt to extract those numbers. Also L1 cache does not operate on fraction of clock cycles. At least 3 cycles.

@GeorgeTsiros 3 ай бұрын

"my program will run faster if i completely disable one of the two sockets on this machine" this is true also for some recent games, where a CPU with 12 cores may run _slower_ than on a CPU with 8 cores, because the CPU with 12 cores is actually two dies (or whatever they are called) with 6 cores each while the 8-core CPU is just one 8-core die.

@LunfardoAR 2 ай бұрын

As matter of fact, Ryzen Master will disable one die when "gaming mode" is on.

@random_bit 2 ай бұрын

can confirm, had to pin a game on my 7950x

@Roibarkan 4 ай бұрын

1:13:42 Note that if task_count_ gets incremented when a task is added/inserted, and done_task_counts_ [i] gets incremented when a task is done (by a core in numa node i)- the Count() function might still wrap around, in case tasks get added and completed between the time task_count_ is read and the time done_task_count_[i] gets read. Because size_t allows wrap-around, it might be better to start the res variable as zero, then decrement all the done_task_count_[i] values (forcing a wrap around underflow), and only then increment the task_count_ (with acquire semantics), which should force the a wrap around back to the right result

@MatthewWalker0 2 ай бұрын

RE: ~ 1:26:00 page table hardware walking -- it wasn't specifically resolved, but I think what's going on is both: the TLB is a cache for the (in-memory) page table (the hardware will walk it on a TLB miss). *But also*, in the case of the (subsequently mentioned) NUMA page migrations the kernel does have to walk it.

@theIpatix 2 ай бұрын

Yeah, that's what I thought. For normal TLB misses OS shouldn't have to walk the page tables. That would really plummet performance.

@peterprokop 9 күн бұрын

Wrote machine learning code using GPU+CPU on an iPhone. Optimized it so much that it started to slow down after 15 sec and a minute before the phone got overheated. That was a bit frustrating because I needed the power.

@treyquattro 2 ай бұрын

it'd be interesting to know which versions of the Linux kernel were involved, whether any of the NUMA slow down is due to exploit mitigation, and I'd be interested to know what Linus has to say about it. I'd also be interested to know if Windows suffers the same magnitude impact given the same or very similar workload (I presume the programs in question were Linux-specific)

@TheDannyDowling 2 ай бұрын

Was a great talk. Learned a few things watching this.

@Wi8had0w 2 ай бұрын

Topic of heterogeneous cpus cud have been a great add.

@keyboard_toucher 2 ай бұрын

23:39 Whiscash mentioned

@ArthurGreen-bw3sb 2 ай бұрын

This was very interesting, and good work exploring the performance of different compilers, but it isn't shocking or worthy of a rant that new hardware requires the latest toolchain to get the most out of it. One would hope that packaged distributions will catch up over time (probably not 109%, but near enough)

@MrCOPYPASTE 3 ай бұрын

Hi, can you specify execution use cases(application wise)? This seems to be all over the place regarding that... Some common granular real world examples/samples/explanations would be appreciated.... Nice talk, thank you.

@mrnettek Ай бұрын

Excellent video! Thanks.

@BoostCon 26 күн бұрын

Your appreciation of Fedor Pikus' presentation is much appreciated!

@heck_fy Ай бұрын

a great talk, thank you

@andytroo 2 ай бұрын

29:25 - speculative execution - are there any cpu's that speculate both ways - if you can't predict, it might be cheaper to speculate both ways than miss 50% of the time...

@denisfedotov6954 2 ай бұрын

To the best of my knowledge, there aren't. Probably, it's impractical to do some work half of which would be discarded anyway. Instead, CPUs focus on executing the predicted branch speculatively. It is worth noting, that speculative execution is based on the so-called reorder buffer, which enables retiring of instructions in original order.

@ratgr 2 ай бұрын

It is a nice Idea but then you need both or N level of speculation and a way to choose when you follow both forks as most of the times you are already on a branch so... just a 3 level of depth will need 8 times the speculation...

@mytech6779 2 ай бұрын

Zen 5 branch predictor can "...do two taken and two ahead branches", It also has dual instruction decoders per core for better SMT but a single thread can use both when SMT is off.

@meetem7374 2 ай бұрын

Hello, thank you for a good talk! I have one question tho, so the prefetcher hides latency by streaming memory upfront, but how does it handle page faults, and how does it performs when virtual memory is contiguous, but physical pages bound to the virtual range is not, e.g for instance we have 2MB of continious virtual range where each 4K page is mapped to a different random physical page

@mytech6779 2 ай бұрын

(This is with a single CPU machine, NUMA has some side effects as indicated in the video) Prefetching is always based on previous read instructions so a page fault must be resolved for those instructions without regard for any prefetching that follows. (The "pre-" aprt of the name is a bit deceiving because it happens post read instruction as an extrapolation.) All memory reads depend on alignment of data to physical memory lines (Usually 512bit boundaries, same as a cache line size, on AMD46 machines.), because that alignment is the block of data transfered over the bus even if you only want one bit, this is overfetching and is useful as a form of prefetching. Recent memory controllers do page aware spatial cache prefetching which does not prefetch across pages, they basically take the requested line plus one. There are also stream and stride prefetchers that spy on the memory address access patterns in cache and prefetch based on an extention of the pattern (I don't know if these are aware of virtual-page address space or otherwise follow page mapping.) Normally they only prefetch a small number of lines, in the range of 1→4 depending on the pattern matching involved, because this is a guess in the dark and the prefetched data consumes both bus bandwidth and L1/L2 cache space when it may not be used. (Possibly L3 space but on Zen the L3 is a victim cache that only contains L2 evictions never direct memory fetches.) 4k pages being 64×512bit lines, the penalty for page boundaries may be around 5% on workloads where prefetching is important. As far as I know spatial-prefetch modules to handle larger or variable page sizes are not yet used on production CPUs but they are being proposed and tested. To be clear, spatial-prefetching is different from looking ahead in the pipline for an explicit instruction to read a block of continguous memory (which may cross pages).

@neonmidnight6264 Ай бұрын

@@mytech6779 What a great response. Thank you.

@dascandy 2 ай бұрын

@23:00 you have j in the left slide in the top row where it should read i

@surters 3 ай бұрын

"There is no hardware assist for TLB updates" that is catastrophic. Having to run software to update it :(

@AK-vx4dy 3 ай бұрын

I'm not sure about linux but x86 could walk page table in hardware without kernel

@mytech6779 2 ай бұрын

Sort of, the page table is still stored and walked in main memory even though it is a space dedicated to the page table module. This makes the speed approximately the same because the memory bus is the limiting component.

@llothar68 Ай бұрын

Its of course CPU bios code. Turning of Paging will make the cpu double the speed

@leshommesdupilly 2 ай бұрын

std::list :3

@GraziaMacahilas 7 күн бұрын

Investments are the roots of financial security; the deeper they grow, the stronger your future will be."

@theevilcottonball 2 ай бұрын

23:17 bad for loop index copy-pasta

@heinzhaberle3326 Ай бұрын

This video may help as well to understand the cache and shared state concurrency: kzbin.info/www/bejne/mIu5eah8bLikrc0

@erenpalabiyik5054 2 ай бұрын

Where c++?

@JarppaGuru Ай бұрын

no need optimize code we llready know. we can change how it compile. doh! its just machine code. different languages not make any sense. end product is same. machine code. why invent wheel and create new language that look same print()

@raymundhofmann7661 2 ай бұрын

Isn't this mostly the job of the compiler?

@aaaabbbbbcccccc 2 ай бұрын

Compilers aren't magic. You should know more about your program than can reasonably be expressed in any high-level language. To give an example of code I wrote earlier today, I have an array of structs, and this array is a priority queue. I know that I will always pop the front off the queue, and then immediately insert a new element into the queue. Using something like std::priority_queue is too slow, as it doesn't know this - it considers pop() and push() to be totally independent of one another. So I can write a simple array structure and handle it myself, with a for loop and just write thus: for (u64 i = 0; i < length_of_array; i++) { array[i] = array[i+1]; if (new_element < array[i]) { array[i] = new_element; break; } } But the problem is going to be the branch, and it'll be very hard to predict. I can use some binary search but it'll be slower for small values of length_of_array, and it will still require comparisons and branching. An alternative approach is for me to do what he's describing - conditional moves. But there's another optimisation I can also do, which is to have the array be longer than I need it to be, so that the array length is the right size for me to be able to load multiple elements with a single instruction using simd (so vmovdqu ymm0, [arr]) without having to worry about going off the end of the array. How do I make sure my comparison logic works? By ensuring that the values I insert at the end of the array are always going to be larger than any realistic input I will give for a 'real' value. The compiler can never know this. I tried for hours to get it to generate some efficient codegen, but in the end had to write it myself in assembly, because the compiler simply doesn't know enough about what my program is going to do, and it never can. Does this matter? Not in most cases. But if you want to write performant code, you can't just leave it up to a compiler. Compilers produce fast codegen for you to use as a *starting point*, not as a final result. Most of the time that starting point is good enough, but you can always improve on it in the critical sections.

@mytech6779 2 ай бұрын

The compiler can only optimize optimizable code. Thus you will hear people say "do not pessimize your code" meaning that while you shouldn't waste time attempting premature optimization, some basic structures are architecturally bad and will stop the compiler from being able to optimize, or the structure is so inherently ill-suited to the task that it will slow the progam below the basic simplistic way to write the code. For example reversing the nesting of loops when iterating over the columns and rows of a 2d array. One way is very cache freindly and the other completely breaks cacheing. But the compiler optimizer can't know what you intended when you wrote it one way or the other and all it sees is the assembly and a 1d array of addresses being locally acessed in a stride pattern (It can't know if it will be accessed in a different pattern by some other app or compile unit.)

@geoffreymak000 2 ай бұрын

@@aaaabbbbbccccccI love your comment. And there are people saying AI is magic and will replace all the human coders in two years.

@runninggames771 Ай бұрын

How were you able to Aquire all this knowledge?

@pahom2 2 ай бұрын

5:35 the guy thinks we have 3MB per core L1. This is nuts. Everything he say about cache optimizations doesn't count. Which is more crazy, he has been asked specifically about the 3MB per core numbers on 7:20 mark and insists on that. He doesn't know basic things about CPU architectures. Common!

@quantum_dongle 2 ай бұрын

He specifically says that was the largest L1/ lowest level cache he could find on the market. Either way, as the size of L3 grows, L2 and L1 are going to have to grow in turn. I would not be surprised if consumer chips started having 0.125Mib-0.25Mib L1 per core very soon.

@DarkReaper10 Ай бұрын

Some webpages sum up the L1 and L2 caches to give a cumulative number rather than a per core number even though they are not shared. The biggest Genoa has 96 cores at 32KB L1 and 1MB L2 per core which results in 3MB L1 cache total and 96MB L2 cache total. That's probably where he made the mistake. Btw, it's probably debatable whether the AMD L3 cache should be summed up per socket or per CCD, giving either 32MB L3 per CCD or 384MB L3 per socket, as the inter-CCD latencies are quite large and sharing is inefficient.

@srh80 17 күн бұрын

💯 I dunno why people come to defend that maybe in future we will have larger l1 cache. We already have a larger l1 cache, its called l2 cache. Both based on 6 transistor sram cells. The only reason l2 is order of magnitude slower than l1 is because its larger so farther and cache addressing needs more time because its, well large. So if l1 becomes larger it either will tend to l2 speed or they have broken 2nm or some brand new paradigm and able to fit more transistors in same space as 32k. Just accept he is wrong, repeats the same talk every year with new sprinkles.

@razshneider2375 3 ай бұрын

14:45 so functional programming is slow? Piping multiple functions on iterators one after another is slower then a single loop manipulating the data? I must preach against FP to my friends now

@treelibrarian7618 2 ай бұрын

the speaker is wrong, about quite a few things, this is one of them, but it depends on the loop. see my much longer comment for details.

@thebatchicle3429 2 ай бұрын

FP has always been fundamentally slow and inefficient.

@gabrielb.2886 2 ай бұрын

FP per se is not slow, although it can be in certain cases for no apparent reason when reading the code. FP, due to its abstract nature, gives you less control over the way computation is performed. At the same time, given that FP describes computation at a coarser grain than C-style programming, the compiler can reason about code at a higher level and perform broader optimizations that are harder to apply to lower level languages. For instance, given a collection of elements C and two mappings f and g applied on C: C.map(f).map(g). This expression alone does not tel whether the calculation will be executed as C.map(f.composed_with(g)) or (C.map(f)).map(g). The former suggests that both functions are applied to each element before stepping to the next element of C while the later applies f to all elements of C before applying g to the new resulting collection. Note these approaches relate in some ways to the lazy vs eager evaluation. There are pros and cons to each approach. The first one does not require saving the intermediate collection to memory before applying g. The second one does not require storing both f and g instructions into cache (together with all the predictor states needed to accelerate this code). At that point, you are facing a compromise: would you rather save on memory footprint or on computation? This is a difficult choice to make for a compiler, especially if it does not know about the size of the collections. However, because the compiler knows that operations can be performed in any order, it could decide to compute, say, a thousand elements of the intermediate collection at a time, then transform these elements into the final collection and free the intermediate buffer. This approach provides both good code and data locality with a bounded memory overhead. I have no idea if such optimization is implemented natively by any FP language though. Another source of slowness for FP is immutability. Again, it will depend on the ability of the compiler to optimize stuff and use mutability (i.e., storage reuse) behind the scene when it yields equivalent results. Again, such optimization could be hard to perform safely and I don't know if it is implemented by any language/compiler.

@mytech6779 2 ай бұрын

FP is a high level abstraction for theoretical organization of your logic. Such programming paradigms are completely irrelevent to the physical limits of hardware computation engines.

@ХузинТимур Ай бұрын

If you use Rust, it can compile all your iterator transformations to a loop with function calls, then inline them.

@philippeastier7657 2 ай бұрын

Excellent deep dive. Unfortunately, focused exclusively on x86 architecture, which is not modern anymore. ARM cores currently have the lead of performance per core our per watt ratio, and there is no real way x86 will be able to compete in te future, besides more complexity, more power usage....

@SkegAudio Ай бұрын

how long do you foresee, if it does, until arm has the largest market share within consumer tech?

@ABaumstumpf Ай бұрын

"which is not modern anymore." That is just wrong - on every single perceivable level.

@srh80 17 күн бұрын

No one cares about "per watt" in a backend data center. Just performance per core. Accept that there are different workloads (devices, servers, ML etc.) rather than touting that your use case It.

@tomaszszupryczynski5453 2 ай бұрын

all of this is bullshit, cos modern day cpu isnt a cpu like old days, all that bullshti about pipelines. risc has 7 instruction pipeline had that even in 1997 when mmx had only 2, today cpu is risc that emulated cisc with microcode that generates code for risc. thats why we can access same data, cos microcode generates code to run that predicts whats we will do. and thats why we have so many exploits today with inserting code, imagine what performance we would have if we could remove microcode and run code directly on risc, like on arm without bytecode

@ABaumstumpf Ай бұрын

"if we could remove microcode and run code directly on risc" Sure - way worse - like orders of magnitude slower.