C++ Weekly - Ep 460 - Why is GCC Better Than Clang?

Рет қаралды 24,321

Күн бұрын

☟☟ Awesome T-Shirts! Sponsors! Books! ☟☟
Upcoming Workshops:
► C++ Best Practices Workshop, ACCU, Bristol UK, Mar 31, 2025: accuconference...
A robust cross-platform IDE for C and C++, CLion makes your development experience smoother and more productive! The latest version of CLion introduces deep integration with the ReSharper C++/Rider C++ language engine, performance improvements for the debugger, new features for embedded development, enhancements for project models and build tools, and much more! Check out all of the details and give it a try today!
jb.gg/clion_ide code: CppWeeklyCLion
Episode details: github.com/lef...
Stats: docs.google.co...
CE Diff: compiler-explo...
Episode 435 - Easy GPU Programming With AdaptiveCpp (68x Faster!) - • C++ Weekly - Ep 435 - ...
T-SHIRTS AVAILABLE!
► The best C++ T-Shirts anywhere! my-store-d16a2...
WANT MORE JASON?
► My Training Classes: emptycrate.com/...
► Follow me on twitter: / lefticus
SUPPORT THE CHANNEL
► Patreon: / lefticus
► Github Sponsors: github.com/spo...
► Paypal Donation: www.paypal.com...
GET INVOLVED
► Video Idea List: github.com/lef...
JASON'S BOOKS
► C++23 Best Practices
Amazon Paperback: amzn.to/47MEAhj
Leanpub Ebook: leanpub.com/cp...
► C++ Best Practices
Amazon Paperback: amzn.to/3wpAU3Z
Leanpub Ebook: leanpub.com/cp...
JASON'S PUZZLE BOOKS
► Object Lifetime Puzzlers Book 1
Amazon Paperback: amzn.to/3g6Ervj
Leanpub Ebook: leanpub.com/ob...
► Object Lifetime Puzzlers Book 2
Amazon Paperback: amzn.to/3whdUDU
Leanpub Ebook: leanpub.com/ob...
► Object Lifetime Puzzlers Book 3
Leanpub Ebook: leanpub.com/ob...
► Copy and Reference Puzzlers Book 1
Amazon Paperback: amzn.to/3g7ZVb9
Leanpub Ebook: leanpub.com/co...
► Copy and Reference Puzzlers Book 2
Amazon Paperback: amzn.to/3X1LOIx
Leanpub Ebook: leanpub.com/co...
► Copy and Reference Puzzlers Book 3
Leanpub Ebook: leanpub.com/co...
► OpCode Puzzlers Book 1
Amazon Paperback: amzn.to/3KCNJg6
Leanpub Ebook: leanpub.com/op...
RECOMMENDED BOOKS
► Bjarne Stroustrup's A Tour of C++ (now with C++20/23!): amzn.to/3X4Wypr
AWESOME PROJECTS
► The C++ Starter Project - Gets you started with Best Practices Quickly - github.com/cpp...
► C++ Best Practices Forkable Coding Standards - github.com/cpp...
O'Reilly VIDEOS
► Inheritance and Polymorphism in C++ - www.oreilly.co...
► Learning C++ Best Practices - www.oreilly.co...

Пікірлер: 96

@cppweekly 19 күн бұрын

Thank you everyone for the great discussion. There's some outstanding responses about using vtune and related tools below. The answer seems to be my modulus function. You can also see more discussion on the related github issue: github.com/lefticus/cpp_weekly/issues/417

@PaulTopping1 Ай бұрын

This should unleash the Clang experts on the problem. Look forward to the eventual episode uncovering the truth.

@rossborden1636 Ай бұрын

Off the top of my head, this reminds me of something the prof in my assembly language courses at university said: Dealing with whole registers is faster for the CPU than partial registers so use whole unless you really need the space. So if your cache is big enough that 64 bit values aren't causing misses, the extra work of dealing with smaller values isn't worth it.

@ajjr1228 Ай бұрын

SURELY the cycle time of accessing rax vs. ax is *no different*, or even doing ops on them.

@xavierthomas1980 Ай бұрын

That was my guess. Given that the memory/cache used is small in both case. The masking and bit shifting, needed to handle integer smaller than the architecture register length, are probably the culprit. But why on a compiler and not the other? And why on a specific optimization level? Also, obviously, in large memory/cache usage the cache miss would have a far greater cost than those masking and bit shifting operations. So this is probably specific to this use case and not something true in general.

@kuhluhOG Ай бұрын

@@ajjr1228 so, I did some testing with Assembly: whole.s: .global _start .text _start: mov $0, %rax loop: add $1, %rax cmp $1000000000, %rax jb loop mov $60, %rax syscall small.s: .global _start .text _start: mov $0, %eax loop: add $1, %eax cmp $1000000000, %eax jb loop mov $60, %rax syscall As you can see, the only difference is that the loop uses eax in small.s and rax in whole.s I created both programs (small and whole respectively) with: gcc -c small.s && ld -o small small.o I then executed both multiple times like this: time for i in 1..10000; do ./small; done I needed to do it in a (bash) loop to actually get an observable number. While I am not sure how representative it is because of that, I do get regularly (meaning, executed often in fast succession) half a second for everything for small and a quarter of a second for whole. So, I do think that it makes a difference. But on the other hand, I needed to actually do an extreme example here to actually be able to show it to a point where I think it will rarely make a difference. Nonetheless good to know in case it does.

@simonfarre4907 Ай бұрын

@@ajjr1228 I am speculating absolutely wildly here; but CPU's has had instruction level parallellism for decades (ILP). Using the full register and being able to discard its contents, abstractly speaking here, could theoretically means that it uses less bandwidth. This is pure speculation. Haven't even watched the video yet.

@dexterman6361 Ай бұрын

I kinda assumed the same yea. int32 and int64 have nice simd intrinsics, not so sure about support for int8.

@giuseppecesarano108 Ай бұрын

Hi Jason, first of all, thank you for the intriguing problem and for all the knowledge you share for free! I've run both versions under VTune and toplev, and they are similar. The only thing that changes is the number of instructions retired, and for some reason, toplev in some runs also reported high fetching latency due to branch resteer but with a low branch mispredict, which is weird. Getting to the bottom of why the fast version had fewer instructions wasn't easy, and in the end, the key was Clang's -Rpass-missed='.*'. Running the build command with that on my machine and with Clang version 18.1.8, the slow version has more "vectorization was impossible" messages. The slow version doesn't get vectorized in two points where the fast one does: slow.cpp:79:43: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer] 79 | return ((dividend % divisor) + divisor) % divisor; slow.cpp:131:63: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer] 131 | return static_cast(floor_modulo(p.y, height) * width + Clang also points to the exact operations that fail to be vectorized: the second modulo in the first message and the multiplication in the second message. Now that I had a good starting point, I went on Compiler Explorer and looked for the difference in the generated assembly for the % operator, and as expected, the fast version looks cleaner. The second modulo gets somehow "embedded" in the general computation, whereas the slow version generates five more instructions, one of which is even a conditional move. Taking into consideration that integer modulo operations are kind of hellish for CPUs, I can easily imagine that they truly are the bottleneck. The last question that remains unanswered is why Clang can perform vectorization on the larger integer but not on the smaller one. But to be honest, I'm not an LLVM engineer, and this question is way over my head.

@nullbox02 Ай бұрын

If the SLP can vectorize the code for larger ints, but cannot do it for the small ints, the problem is definitely in the X86 cost model. SLP vectorizer can detect vectorizable patterns, but considers it non-profitable for the vectorization. LLVM TTI X86 arithmetic cost estimations for smaller types must be adjusted

@nullbox02 Ай бұрын

Also, you can try to play with -mllvm -slp-threshold=-1 (or -2, -3 etc.), which allows to bypass some SLP cost estimations

@giuseppecesarano108 Ай бұрын

@nullbox02 I don’t think this is a cost model problem, as LLVM explicitly states in those cases that, according to its internal cost model, vectorization wasn’t beneficial. However, in the cases reported above, LLVM says vectorization was "impossible with the available vectorization." This is further supported by the fact that even if I follow your suggestion and adjust the slp-threshold, the compiler still doesn’t vectorize the two operators in the slower version.

@nullbox02 Ай бұрын

@@giuseppecesarano108 The "slower" version is more vectorized than the "faster" version. The "faster" code remains scalar (exactly because of i64 index type, which prevents building of SSE vectors - only 2 elements will fit into SSE2 128bit vector registers, while i8 allows the build of 16 element vectors). So, generally speaking, the "slow" version may work faster on the newer hardware. If you want to compare performance more correctly, try to use "-march=native", which allows better scheduling/cost estimation for your hardware

@kiseitai2 Ай бұрын

There is a recent Primeagen video with Casey Muratori. They were going over why some benchmarks are stupid. One take away is what you found. The integer division generated by modulo is unvectorizable. If you roll your own modulo with using double casts (~3 lines), the compiler then realizes it can be vectorized so they showed a 3x improvement for their case. The improvement was roughly equivalent to the number of items they could vectorize per loop pass.

@Cybot_2419 Ай бұрын

One issue I see is the use of indices array. For each board value, it stores a Point type of its location. Assuming 32 bit-indexing, this means that the indices array takes up 8 times as much data as the game board itself. Since this is absolutely trivial to recompute, it is better to do so to reduce memory bandwidth and cache footprint. Replacing std::transform with two nested for loops gives me a 2x speedup with both compilers.

@WizardIke Ай бұрын

The nested loops let the compiler hoist half the modulus operations from the inner loop to the outer loop. Which even after the optimizations from the divisor being known at compile time is a very expensive operation compared to everything else in the loop.

@kensmith5694 Ай бұрын

A few thoughts: 1) Cache is quite big. If you fit within the cache making things smaller often will have no effect. 2) Which cache pages you make dirty as you go can matter to the throughput 3) Compilers will often group variables based on size to keep them within word boundaries. 4) A compiler can work out that it only need bother with the bottom byte/word of an uint65536 and save a lot of cycles 5) If you always modify Y when you modify X the compiler may place them side by side so one memory access loads/stores both 6) How hard a compiler works on constant folding and repeated expressions at different -O levels can vary 7) On many machines you gain some speed if your loop's address is at a multiple of some 2^N

@pierregiraut8299 Ай бұрын

The issue appears to be related to false data sharing. When smaller indices are used, multiple threads are likely to access data within the same cache lines, causing frequent cache invalidations. This negates the benefits of parallel execution policies. To validate this hypothesis, analyze cache miss statistics. A potential solution is to introduce additional padding between data elements to ensure that threads access independent cache lines. For optimal performance, consider manually partitioning and distributing compute tasks across a thread pool executor, ensuring that each task operates exclusively on unshared data. The performance disparity between Clang and GCC likely stems from differences in how these compilers optimize the memory layout and distribute the compute task. Paradoxically, you will be better off with less compact data to avoid false sharing :)

@sergiymykhaylyk Ай бұрын

Thank you for the video and analysis. Yes, you are absolutely right for your hardware, test's execution flow and data set. With different hardware, execution flow or data set you might get different results.... Conclusion: test your data and common scenarios and choose compiler&options based on your results.

@cuda_weekly Ай бұрын

Working with bytes on X86 has issues. A byte register overlaps a 64-bit register, but modifying the byte does NOT modify the upper (64-8=) 56 bits. Causes stalls in the CPU pipeline, because the CPU must wait to the results of the byte to compute before it can to stuff with the full register. (this is called a partial register stall). 32-bit operations on the other hand zero out the top 64-bits. Meaning that this is just as fast as 64 bit operations. This matters if you do multiplications and such where a 8-bit x 8-bit gets multiplied and promoted to a 16-bit (another no-no). The compiler knows that and injects code to avoid having to do 8-bit operations, hence the lookup tables. If you get you indices to 32-bit you will get (slightly) faster code than the 64-bit. Obviously all the extra code trashes your cache as well.

@musik8000 Ай бұрын

Instead of int#_t, is it better to use int_fast#_t so that the compiler can decide to use a wider type? With gcc-14.2.1 and clang-19.1.5 on 64 bit x86_64, sizeof(int_fast8_t) is 1, but sizeof(int_fast16_t) and sizeof(int_fast32_t) are both 8.

@amirh6712 Ай бұрын

It would have been nice to compare the LLVM bitcode that Clang generated for two different versions. Just to see if there is something weird going on between Clang and LLVM If I had to guess, I would say that Clang is emitting some alignment attribute for smaller field sizes that makes LLVM go crazy and and in turn generate unnecessary memory access ops.

@_ArtemB_ Ай бұрын

It may have something to do with clang/llvm upcasting indices to match the size of the pointer when it needs to index into the array, while the loop variable calculations are still done using the loop variable type. This can add some overhead in tight loops and may sometimes affect loop unrolling if the loop is just around the unroll threshold.

@Y2B123 Ай бұрын

I had a similar experience recently with gcc. I changed a 16-bit integer type to 32-bit and gained a 2x performance improvement immediately. I didn't think too much of it because it was among a series of refactors. I assumed that it was due to some weird optimization issue with integer promotion or something.

@anon_y_mousse Ай бұрын

Okay, I see a few problems with your code. One possible area of squiffiness is the point struct, two 8-bit values could cause some weird access code to be generated at specific optimization levels, depending on packing and how the compiler understands the padding, but I'd only look there second of all. The fact that the game boards are allowed to be sizes that aren't a power of two really screws with things because you're performing a lot of divisions with each and every index operation just for wraparound bounds checking. If you used a power of two then you could just do a `bitwise and` and bounds checking is essentially free then. However, the problem isn't just that it would be slow to do division at all, but that division on 8-bit values causes weird code to be generated. So I'd look there first. If you really insist on using modulo division to bounds check and allow oddball board sizes, then I would go with calculating the reciprocal of your divisor at the setup for each board and use multiplication instead.

@musik8000 Ай бұрын

x86_64 CPUs are optimized for 64 bit operations. Even though 64 bit operations use more bits, the CPU might have more transistors per bit dedicated to them. 16 bit and some 32 bit operations require an operand override size prefix. Using a partial register limits what the compiler can do with the remainder of the register, and storing multiple variables in a register can make it harder for the CPU to guess how to pipeline instructions, so using partial registers sometimes doesn't gain anything.

@johnmph7562 Ай бұрын

Maybe some alignment problem, if you have a 64bits register processor and your struct uses 2 16 bits values, then it needs to do some mask and shift to match 32 64 bits

@WizardIke Ай бұрын

When you call `floor_modulo`, `dividend` is a signed type (e.g. std::int8_t or std::int64_t) and `divisor` has type std::size_t which means `dividend` will be converted to type std::size_t during the modulo operation. This will likely give the wrong result when you pass -1 as `dividend`. EDIT: this only happens for the fast code, the slow code uses a signed type for the `divisor` to. Clang seems to have optimized `((dividend % divisor) + divisor) % divisor` into `dividend % divisor` for unsigned types as they are the same when Clang can prove that the add doesn't overflow. EDIT 2: the slow code sizes its type to hold width but `floor_modulo` stores up to 2 * width - 1 in the largest of it or int which would be a problem for widths very close to the max value of int.

@cxzuk Ай бұрын

Not sure if you've done a video on this subject yet, if not. Might be a good time to do one on -Rpass=.* -Rpass-missed=.* etc (for Clang) and -fopt-info etc for GCC. You can use these flags in godbolt and get underline help tips etc. 10000 for Width and Height fits into a int16 and is being vectorised.

@arkadijsslobodkins8149 21 күн бұрын

In general, clang has fewer bugs, generates better warning and error messages, and complies more with the C++ standard. For this particular integer problem, it turned out that clang has improvements to do. But by and large, clang has caught up with GCC's quality of optimizations. The only reason I sometimes prefer GCC over clang is because it has some useful extensions and offers more recent features from C++23 and C++26.

@japedr Ай бұрын

9:18 it actually can be even consteval (and I would suggest it), given that everything in there is known at compile time.

@Hartor Ай бұрын

To me it looks like an alignment problem. I just checked the run_board() version, and it generates so much more instructions. At -O2 the min_int_t version is closer to the int64_t at -03. Aligning the members of Point to 8 byte improves a bit, but not completely. I have no experience with x86, but from my ARM CortexM7 days misalignment is really inefficient.

@treyquattro Ай бұрын

excellent question. I've reached that conclusion, but it's really based on outdated information in my case. GCC works so why change it... My immediate thought (now that I've actually watched to the end) is that the processor really prefers naturally aligned data: you pay a not-insignificant premium for non-aligned or byte-aligned data. That said, you did mention about keeping as much data in cache as possible (question: are you actually achieving the desired caching?), and I don't know that the unaligned access penalty is paid, or at least costs as much, with data already in cache. I'd like to know what happens with (aligned 32-bit) quantities since that's the natural integer size for contemporary Intel architectures. I would expect the same if not (slightly, maybe un-measureably, greater) perf gains.

@Ratstail91 Ай бұрын

I'm working on a scripting language, and I've found that having the data aligned to the 32-bit word size gives an immense improvement to the speed & correctness - and I suspect the 64-bit code you generated has something to do with either the word alignment, or the register sizes. If the datatypes are only 8-bits, the 64-bit registers need to do extra work to handle them.

@cristian-si1gb Ай бұрын

Weren't arithmetic types smaller than int (or std::int_fast32_t) known to be significantly slower and harder for the compiler to optimize around?

@Y2B123 Ай бұрын

Could you elaborate? I encountered this exact issue once on an older gcc. uint16_t -> uint32_t alone made a 2x performance improvement. I was shocked as Jason since smaller integers should improve caching. I just assumed that integer promotion rules somehow prevented some optimization but didn't investigate the mess and moved on after the code ran fast enough.

@cristian-si1gb Ай бұрын

@@Y2B123 The native word size on most CPUs is 32bit or 64bit. Meaning that they can perform basic arithmetic operations only on registers of this size. So doing arithmetics on an uint16_t would in the most naive implementation require a mask read of the value in another register, performing the desired operation and then mark copy the result back. Even though the compiler will usually do optimisations, the resulting assembly will never be optimal.

@Y2B123 Ай бұрын

@@cristian-si1gb Thanks for the reply. I am not familiar with computer architecture and forgot about memory addressability. Still, it is unbelievable that one or two additional instructions per item can sometimes dwarf memory loading delays. Perhaps there wasn't much locality to begin with.

@yashenkin Ай бұрын

to have more relevant results and saying for sure about differences of the g++ vs clang++, it needs also a lots of source code (different programs from different authors for different purposes - emulators, compilers, games, science software, software for large hadron collider, space/aircraft software, banking/stock trading software, operating systems written in c++ etc.) to be compiled under all the options..... which is, no doubt, difficult to do

@toby9999 Ай бұрын

Not really.

@taw3e8 Ай бұрын

So is next video about int_fast? ;)

@tjthill Ай бұрын

Top of my initial-suspects list: clang decided/managed to prove there's no aliasing going on even though the index references are to some one-byte hence `char` type, and so they have to be presumed to alias everything.

@heavymetalmixer91 Ай бұрын

Given that this could be a bug and you're using Clang 18.1.0, why don't you try with a newer version? Right now you can use 19.1.6 and Godbolt has 19.1.0 available.

@cppweekly 19 күн бұрын

I tried multiple versions, the results were largely unchanged.

@Sonnentau1 Ай бұрын

Whaaaa? I get homework as a Christmas present. Best. Christmas. Ever.

@LewisCowles Ай бұрын

try aligning to 64-bits overall when shrinking to 8-bits by having a dummy var created. It should be even faster as there should be one var with bit masking. If that does not work, I'd check if you're needing to convert from 8-bit to 64 bit n times somewhere else (which could account for difference). The int size has to stay the same for all iterations or you're not gaining anything in cache.

@troyfrei2962 20 күн бұрын

Have you tried LLVM compiler?

@fcolecumberri Ай бұрын

I wonder if -march would have a different impact on ARM/RISC-V machines, since IMHO those tend to be a little bit more diverse with their features while x86 is very much a set arch with not that many different features among cpus (Also I might be wrong, I am just wondering).

@sinom Ай бұрын

On x86 -march can be used to automatically enable SSE, AVX, AVX512 etc. which is great for highly parallelizable stuff but not as useful for most sequential things

@mytech6779 Ай бұрын

It all depends on which optional extensions could actually benefit the task verses generic one size fits all instructions (Maybe to a much lesser extent the relative quality of the HW implementations on a per module basis. ie If the scalar ALU units are amazing but the SIMD unit is a terrible kludge.) So there could be major benefit for one particular combination of program and hardware while another sees insignificant changes.

@iuiui-iuiui Ай бұрын

Do you have checks/tests for your game-of-life results? If you don't check results and you have removed std::cout, then clang tends to just kill lots of code as dead code and then it is super fast 🙂 Also if you trigger undefined behavior, then clang can do very weird stuff, including removal of even more code.

@shardator Ай бұрын

std::optional fn(int x, bool b) { std::optional res; if (b) { res.emplace(x); } return res; } Now compile w gcc and clang using -O3, and see the crap gcc generates.

@paradox8425 Ай бұрын

That sounds like for some reason, they made some optimizations for 64 bit types assuming they would be used a lot and didn't do the same for smaller types. Could be related to using them as indexes

@mr_waffles_the_dog Ай бұрын

You get the bulk of the performance back just by switching GameBoard::width and GameBoard::height to size_t, though why the presumed pessimisation occurs is beyond me. That takes perf on my machine from 40s>23s. Switching the storage type to int64_t takes it down to 15s, but absent gcc comparison I don't know how fair that is

@mr_waffles_the_dog Ай бұрын

Yeah, so tracking through this, the core perf improvement in clang's codegen is not related to the field type, it's the behavior of `floor_modulo`. When you made the "help clang" changes that changes the deduced type of the divisor parameter in `floor_modulo` to `size_t`. As far as I can make out just forcing floor_modulo to have an unsigned divisor gets pretty much all of the perf fix in clang. This does not of course answer the question of why clang falls of a cliff here, and given this means you have a signed dividend and unsigned divisor you run straight into the misery of sign mixing and I'm not sure that's really something anyone wants to try and reason through.

@pyajudeme9245 Ай бұрын

Awesome video! Please make more videos about this topic!

@botsjeh Ай бұрын

I didn't see a commit with the type hardcoded as int8_t or int32_t. I am pretty sure that clang is just confused with the constexpr min() function and misses the point that the type is an integer, or something like that.

@codures Ай бұрын

Let's see: Rax is 64 Eax is 32 Ax is 16 But: Ax is also Ah

@RichardEricCollins Ай бұрын

Your code where you try to pick the best index size is an example of a coder trying to optimise code instead if leaving it to the compiler. Been decades since we had to do this. Trust your compiler and stop trying to do its work. As for why small index was slower. Its the "read claim" engines in the cache circuits. I dug deep into this subject twenty years ago for a console. The tl;dr is packing data can be very bad for cache.

@PermanentWTF Ай бұрын

Am I the only one who wants to see ICC in comparison? There's not much data about it.

@gast128 Ай бұрын

Perhaps use a profiler to know where the hotspots are for further study and comparison.

@khatdubell 16 күн бұрын

"contrary to what you might have been told..." Yeah, but you're probably writing quality code. If you try to compile code that does weird/UB stuff, its probably true that o3 produces worse results.

@bozhidar.varbanov Ай бұрын

I have noticed that clang-18 is faster than clang-19

@HenrikFrejasFar Ай бұрын

Gcc and CC and C++ do produce better code .. in some circumstances (most) but llvm (clang and clang++ ) can live parallel on a system , and based on you problemspace or application it is either the gcc or the llvm (you can even use both in the same project because they both produce *.o etc.) The thing about llvm is that you can produce intermediate stages of the code which enables you to create your own stages. This makes llvm better for creating novel languages for domain specific use etc. because you just have to produce output for a middle stage and can use the rest of the llvm to optimize and produce the endcode (bin). It is Neither the one or the other. They are two different tools in your toolbox. Get to learn your tools properly before critiquing one over the other ... I do not screw in philips head screws with my hammer.

@toby9999 Ай бұрын

I use MSVC and clang/llvm side by side. I have built configureations for each. So far, MSVC always wins on execution speed by around 10%. I have never used GCC. It's just too much hastle to set up on Windows.

@HenrikFrejasFar Ай бұрын

@@toby9999 Well that is based on your workflow, and yes many compilers win on execution speed for most tasks, but that is not what llvm is best for. Llvm is not just a c or c++ compiler, and for general tasks its not the best option, but If you work with language implementations it gives you a better (easier) env. to implement because it is modular and you can just write your lang as a frontend and have llvm handle the backend. Even JIT .. THAT is what llvm is for.

@manda3dprojects966 Ай бұрын

The problem is GCC compilation is like 3 times slower. Even if it's a lot faster than many other compilers like MSVC in term of performance.

@toby9999 Ай бұрын

I wouldn't expect GCC to be much better than MSVC in terms of code performance. I've been benchmarking MSVC vs Clang, and MSVC beats Clang everytime and sometimes by 2x. I've never used GCC, but given what I said above, I don't see GCC beating MSVC by much?

@manda3dprojects966 Ай бұрын

@toby9999 Maybe your're right about GCC not more performant that MSVC, but in debug mode, MSVC is very very slow, and in debug mode, GCC is fast, but Clang is slow in Debug mode. The fact that GCC compilation time is too slow, I don't use it, anyway Release mode is the most important.

@toby9999 Ай бұрын

@manda3dprojects966 I'll agree with MSVC being slow in debug mode. Wish I could get better performance from Clang with Visual Studio. I've never figured it out. It's always slow, contrary to what most people claim. My comment regarding GCC specifically was based on benchmarks and the relative performance of GCC vs. Clang vs. MSVC.

@vladimir0rus Ай бұрын

This is how "zero cost abstractions" looks like in a real life, haha =))

@mvuksano Ай бұрын

I suspect this may have to do with how the CPU accesses memory. On a lot of CPUs if an address is properly aligned memory access will be much faster.

@literallynull Ай бұрын

GCC vs Clang is a fight of two giants. Meanwhile MSVC can't optimize basic stuff and vectorize code

@LiviusHU Ай бұрын

What about with -Ofast

@cppweekly Ай бұрын

That only affects floating point computations, so it shouldn't make a difference...

@Voy2378 Ай бұрын

if O3 bloats up binary size as conference talk you quoted claims then obviously it does not matter for tiny programs as yours, so this is not disproving anything.

@mytech6779 Ай бұрын

It's probably not that int64_t is particular fast. More likely that there is something about the implementation of that 8-bit type that is clogging the works. Especially considering that you were using a fairly new consumer-grade GPU for acceleration, most of which generally don't have native 64b implemented in hardware, so the emulated 64b is 1/10th as fast as 32b rather than ½ as fast. (To get GPU hardware double-precision, you either need an nvidia card with xx100 GPU chip, which are like $20k used. Or some of the ~10 year old AMD FirePro workstation cards, some of which had very janky compute drivers, esp GCN1 and 1.1, so good luck.)

@danielrhouck Ай бұрын

Not at a real computer now so can’t look, but my thought is to keep turning off optimizations (because -O0 is not literally zero any more than -Wall is all) and see if one of those is the culprit

@stephenhowe4107 Ай бұрын

I s this release mode? I assume so.

@DodoLP Ай бұрын

when you wrote "GCC > clang" it seems like it was the other way around - bigger number for time

@xaxi5 Ай бұрын

Just assume the ">" means "better" to solve the confusion.

@DodoLP Ай бұрын

@@xaxi5 assuming ambiguity is a good way to fuck things up ;)

@Lion_McLionhead Ай бұрын

Too bad every optimization nowadays requires an entirely new compiler or an entirely new language. It's all about driving engagement.

@LeaoMartelo Ай бұрын

custom title

@leshommesdupilly Ай бұрын

Compilers are magic lmao

@panjak323 Ай бұрын

Ever heard of formatting in tables ? Why not use common base (seconds) up to 2 decimal points? The table is unreadable and anyone hardly knows what you are describing.

@botsjeh Ай бұрын

Mostly, only the headers were aligned wrong.