Parallel C++: SIMD Intrinsics

  Рет қаралды 5,632

Nick

Nick

Күн бұрын

Пікірлер: 15
@markusbuchholz3518
@markusbuchholz3518 Жыл бұрын
Hi Nick, I do not have a question but I would like to highlight again that your channel is remarkable. As far as I know, only by following your channel one can capture in a consistent way the latest achievements in SW, especially in excellent C++. It is a huge distinction to be here. Additionally, I appreciate your effort used for the preparation video each day. I have been using C++ for over 2 decades mostly within the robotics domain. Your impressive work gives all of us (the community) a new look at this beauty and encourages us to study more. Thank you so much. Have a nice day!
@NotesByNick
@NotesByNick Жыл бұрын
Thank you for the kind words! Always nice to hear when others enjoy the content. Autonomous robotics was where I got my start in research many years ago (primarily with SLAM) before moving more into the architecture/performance side of things. Cheers, --Nick
@user-vw5ex4wf1e
@user-vw5ex4wf1e Жыл бұрын
Hey Nick, amazing videos as always! Compiling with -ffast-math seems to unlock intrinsics for the transform_reduce baseline as well. Btw your videos are very inspirational, keep it up!
@shaytal100
@shaytal100 Жыл бұрын
The performance benefit of using SIMD intrinsics is really impressive! I wonder how often the use of SIMD instructions could speed up every day computing tasks. My really blind guess would be that they are very underused even in computing intensive software. Thanks Nick for this fantastic series so far!
@NotesByNick
@NotesByNick Жыл бұрын
Glad you enjoyed it! For many cases, the auto-vectorizer code is good enough. There can be an incredibly high software development cost for using low-level intrinsics (programming in assembly can be tricky work). There are great examples though of code written entirely/almost entirely in assembly. Intel's MKL (math kernel library) is a great example (along with many high-performance linear algebra libraries out there). Cheers, --Nick
@shaytal100
@shaytal100 Жыл бұрын
@@NotesByNick Well, that is true. I guess there are many libraries for common algorithms that make good use of SIMD instructions. Good point!
@eladon19153
@eladon19153 6 ай бұрын
Hi nick, I just read about the alignment, and I would like to know why is it an improvement to align at 32 and not 64.. because 64 alignment (on 64bit system) would mean worst case of 4 cache misses and read of 64 bytes, while alignment of 32 would mean worst case of 6 cache misses. Unless we are talking in 32bit system. again, I might be wrong with how I perceived the cache, but I figured I will just ask while I still read about it. Thanks alot
@juancolmenares6185
@juancolmenares6185 3 ай бұрын
Why was it that the compiler did not recognize that it could use the vdpsp instruction? you did mention something about the compiler implementation, but dot product seems like something it should be able to figure out...
@NotesByNick
@NotesByNick 3 ай бұрын
If I recall correctly, it's because of the compiler's guarantees about the floating point arithmetic. Compilers will guarantee floating point results (regarding the ordering and precision being used) to give repeatable results across platforms. Vector dot product I believe uses a higher precisions for intermediate operations, and only rounds the final results, therefore giving a different result than if you were to do a dot product in a standard way, floating point standard compliant way. That result will often be more accurate than the standard calculation, but it is non-portable (because intrinsics are hardware specific)
@juancolmenares6185
@juancolmenares6185 3 ай бұрын
@@NotesByNick great, thank you fir the explanation and the content!
@ahmedazeem5975
@ahmedazeem5975 Жыл бұрын
Is it possible for you to also cover Arm Neon intrinsics if its possible? This is a good topic and a good video :)
@NotesByNick
@NotesByNick Жыл бұрын
Thanks for the suggestion, and glad you enjoyed the video :^) I would like to do more ARM-based performance videos, but unfortunately, I don't have an ARM proc at this moment, so it's a non-starter until that changes. Cheers, --Nick
@anm3037
@anm3037 Жыл бұрын
It’s unfortunate that SIMD doesn’t fit so may practical scenarios.
@SneedsFeeduckAndSeeduck
@SneedsFeeduckAndSeeduck 8 ай бұрын
SIMD should be designed like a GPU kernel, to execute one stream of instructions on arbitrary amounts of independent data, instead of simply making a normal program that operates on 4 or 8 or 16 values at once. It just doesn't lend itself well to arbitrary processing the way it is currently common.
@Kaassap
@Kaassap 6 күн бұрын
twohundredandfiftysix
Parallel C++: Unsafe Math Optimizations
13:00
Nick
Рет қаралды 1,9 М.
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН
Cheerleader Transformation That Left Everyone Speechless! #shorts
00:27
Fabiosa Best Lifehacks
Рет қаралды 16 МЛН
SIMD and vectorization using AVX intrinsic functions (Tutorial)
1:06:15
Joel Carpenter
Рет қаралды 27 М.
The Art of SIMD Programming by Sergey Slotin
52:06
Performance Summit
Рет қаралды 11 М.
Christmastide of Compiler - Towards an 'Iterable' Typeclass
1:15:12
Intrinsic Functions - Vector Processing Extensions
55:39
javidx9
Рет қаралды 127 М.
Parallel C++: MPI
16:42
Nick
Рет қаралды 12 М.
Adventures in SIMD-Thinking - Bob Steagall - [CppNow 2021]
1:31:09
Parallel C++: MPI Gaussian Elimination
15:00
Nick
Рет қаралды 2,9 М.
Branchless Programming in C++ - Fedor Pikus - CppCon 2021
1:03:57
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН