AVX512 Properly Explained! - Performance and Syntax Analysis

Рет қаралды 9,017

Proceu Tech

Күн бұрын

Пікірлер: 44

@rn3srk1 Жыл бұрын

Criminally underrated techtuber

@KvapuJanjalia 5 ай бұрын

The fact that the current high-end gaming Intel CPU (14900K) does not support AVX512 is insane.

@HandSwamp-c7r Ай бұрын

It's even more insane that there are plenty of older non avx cpu's, that can run the most demanding games on the highest settings. A good example is Cyberpunk 2077. At first this game would only run on avx supported cpu's. After some time they removed the avx check. And guess what? With my 6 core 12 treads Xeon x5675 running at 5Ghz wit non avx, i can play this game with all settings on max with a perfect 144 fps. So i really don't see why they just make the game also compatible also for non avx cpu's. Because more than once it has been proved that all these games can be played on non avx cpu's Right now i'm playing Silent hill 2 remake, that also required avx. But thanks to some smart guy, he made a non avx tool that makes it able to play this game on non avx cpu's. And again with all settings on max on my non avx cpu, it runs like a charm.

@mika2666 23 күн бұрын

Intel’s desktop P cores have had support for AVX512 since 11th gen but they disabled it since 12th gen because the E cores don’t support it :(

@MarekKnapek Жыл бұрын

Recently, I implemented the Serpent symmetric block cipher (AES candidate) in portable C using 32bit unsigned integers. Then ported it to 128bit SSE2 for 4× the performance, and later to 256bit AVX2 for additional 2× speedup on top of SSE2. That thing scales like magic. I don't have AVX-512 integer capable computer. Reading Intel's intrinsics docs isn't really that difficult like I was afraid of initially.

@MrHav1k 4 ай бұрын

This was so well explained! Thank you!! The CoD zombies clips in the background actually helped a ton with following along on the video haha. Great stuff!

@nate6908 Жыл бұрын

how is avx512 used for Inference? from what i understood in this video avx512 enables you to execute a Multiply (or Accumulate) instruction for eight double precision floats (8*64=512, thus the name) so could quantized models to int8 then execute 64 int8 with one single instruction instead of decoding the same instruction 64 times? the company neuralmagic even goes the route of saying cpu inferencing is the way forward bbut even with "64simd", GPUs are still much more parallel i thought

@problematic3255 8 ай бұрын

I... wh... okay THIS is how my friends feel when I start talking about various cpu's gpu's and other hardware... this whole video felt like I was listening to an entirely different language. am I just stupid or something for not being able to pick up on context clues?

@ProceuTech 8 ай бұрын

You’re not stupid- this isn’t common sense information, and a lot of it builds on other information that without having knowledge of it, it creates holes in understanding. Kind of my fault for not explaining things well enough

@problematic3255 8 ай бұрын

@@ProceuTech I mean I know 512 starts with 11th gen and then disappears and is replaced and/or updated to a new thing on 12th gen when looking at instruction sets, but that’s as far as my knowledge goes lol I should’ve looked up instruction set basics videos before looking up the new new stuff lol

@ProceuTech 8 ай бұрын

AVX-512 is a weird one because you’re right. Intel didn’t support it officially on 12th Gen (but it was still active in hardware in early revisions available to the public), but it’s been entirely fused off with 13th and 14th Gen. the next gen of AVX, called AVX10, aims to fix this by allowing programmers to still utilize AVX512 code, but it will be able to “double pump” an AVX2 hardware data path similar to what AMD does with Zen 4. Weird stuff but you’re not stupid for not understanding!

@ChrisM541 Жыл бұрын

You have to understand that, in it's most basic form, shifting from an 8 to a 16bit CPU carries an automatic 'SIMD' upgrade to all increased registers, for rather obvious reasons. With today's 64bit CPU's, adding separate large registers and applicable opcodes (opcodes which have become more complex/powerful) can - and does - have the effect of stalling a general move to greater than 64bit CPU's. Today, were merely extending hybrid architectures, and today's 'large register' extensions are our means to do that.

@yumenokoyume 9 ай бұрын

I'm no programmer but, I wonder what happens if you run a AVX2 program on a processor that doesn't supports it. Like an Intel i5 3470.

@ProceuTech 9 ай бұрын

It will throw a seg fault and crash :(

@yumenokoyume 9 ай бұрын

@@ProceuTech Thanks for the reply. I'm using a 3rd Gen i5 for my video editing and VFX. But Adobe 2024 installers won't allow me to further install because AVX2 is not supported. I'm just kinda curious what happens if I ran the program. 🤣🤣

@ferna2294 7 ай бұрын

@@yumenokoyume Usually they program their apps in a way that they have some fallback ability when we talk about the LATEST tech, so someone who has a couple gen older hardware can also use their app. However, since it´s been more than 10 years of the standarization of AVX2, Adobe probably doesn´t care anymore about backwards compatibility.

@yumenokoyume 4 ай бұрын

@@ferna2294 Yeah, so true. But understandable since new tech are getting so good that looking backwards aren't getting any profitable nowadays

@cdriper Жыл бұрын

vector at in high performance loop? )

@ProceuTech Жыл бұрын

You could also theoretically do a vector.data()+sizeof(int32_t)*i;

@cdriper Жыл бұрын

@@ProceuTech vector::at validates index on each access, vector::operator[] doesn't (pass vector by reference to simplify access to the operator[], moreover prefer to use passing by reference if null invariant is not expected) but yeah, more important point here is that w/o good optimization each indexed access to an array means "offset + index*sizeof(element)" also it's not a good idea to put a condition inside a loop because in that case a performance depends on other optimization -- a branch prediction inside CPU

@dennysgrimaldi9623 Жыл бұрын

nice video, deserve more views

@arnoldn2017 21 күн бұрын

In think that AVX612 is overrated for floating point ops The code overhead that is required to shuffle data around in the ZMM registers absorbes as lot of the efficiency gain for the actual _mm512 intrinsia. Against an optimized C++ program ai squeeze out 15% speed gain at most writing in assembly, using intrinsics this gain is slightly less

@ProceuTech 20 күн бұрын

It’s very hard to disagree- most compilers are good enough in 2024 to use these intrinsics without needing to use them in your code, they’re a flag you can turn on and off in the config/settings.

@MZRFaith 11 ай бұрын

Most emulation requires avx-512 to run stable, the intel cpus to me have been trash in performance, the 5600x I have is way better.

@Adamchevy 11 ай бұрын

This is the sole reason I havent upgraded from my 11900k. Why intel went backwards on this I will never understand. I guess emulation isn't something they care about.

@CaptainScorpio24 9 ай бұрын

@@Adamchevy my i7 12700 non k has avx 512😊

@Adamchevy 9 ай бұрын

@@CaptainScorpio24 the early ones do, but the later ones do not. And it isn’t in the 13th or 14th gen. Ofcourse with emulation under attack it might not be that important a year from now. But when you use RCPS3 it makes a huge difference.

@stevensv4864 8 ай бұрын

Bro 5600x IS trash compare to 12 13 and 14 gen intel, even without avx512😂

@stevensv4864 8 ай бұрын

@@CaptainScorpio24can you test god of war 3 with the same settings as my videos

@WallisGabriel-r6p 2 ай бұрын

Alberto Fort

@youtubeshadowbannedmylasta2629 Жыл бұрын

and it just makes performance worse.

@nidalspam509 11 ай бұрын

Not in the case of zen 4 from amd.

@MrKatoriz 7 ай бұрын

Intel's garbage nodes that can't just not melt upon seeing an AVX512 instruction are the reason (entire CPU downclocks for signifacant amount of time as soon as AVX512 instruction is executed).

@panjak323 5 ай бұрын

@@nidalspam509 It makes the performance same as with avx2. Because of functional units only being 256bit splitting the workload on 2x 256 bit operations. It can be argued that having full 512b FUs and running at 70% clockspeed is still better than 256b FUs and 100% clockspeed.