Mathieu Ropert “This Videogame Programmer Used the STL and You Will Never Guess What Happened Next”

Рет қаралды 71,631

4 жыл бұрын

CppCon.org
-
Discussion & Comments: / cpp
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2019
-
The STL is sometimes seen as a strange and dangerous beast, especially in the game development industry.
There is talk about performance concerns, strange behaviours, interminable compilations and weird decisions by a mysterious "committee".
Is there any truth to it? Is it all a misconception?
I have been using the STL in a production videogame that is mostly CPU bound and in this talk we will unveil the truth behind the rumours.
We will start by a discussion about the most common criticism against the STL and its idioms made by the gamedev community.
Then we will see a few practical examples through STL containers, explaining where they can do the job, where they might be lacking and what alternatives can be used.
Finally we will conclude with some ideas on how we can improve both the STL for game developers and also how to foster better discussion on the topic in the future.
At the end of this talk, attendees should have a solid understanding of why the STL is sometimes frowned upon, when it makes sense to look for alternatives to the standard and most importantly when it does not.
-
Mathieu Ropert
Paradox Development Studio
Experienced Programmer
Stockholm, Suède
French C++ expert working on (somewhat) historical video games. Decided to upgrade his compiler once and has been blogging about build systems ever since. Past speaker at CppCon, Meeting C++ and ACCU. Used to run the Paris C++ User Group. Currently lives in Sweden.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Пікірлер: 73

@ABaumstumpf 4 жыл бұрын

Well - it just IS True that you have better CONTROL over performance - but in no way does that mean having better performance. It just means that if you know what you are doing, know the software, the compiler, the flags and the target platform you can custom-fit your code to that. it also means that you limit the compilers ability to optimise the code while most compilers do a great job on that. seriously, i have seen compilers optimise away the loop-variable and 2 important flags that i added cause i didn't find a way to do the thing without them. I was really surprised when i wanted to debug a segfault only to find that the most important bits of data (at least in my opinion) were not there. On the other hand, i have had instances where i was able to gain very extreme cases of speedups (like factor >20) for some very compute-heavy tight loops, making it stay nearly entirely in L1 and a lot of stuff in-register.

@AndreaGriffini 4 жыл бұрын

I remember seeing a slide about allocating a struct with two ints with malloc an using it without calling a placement new being UB but I cannot find it any more. Was it edited away?

@LunaLikesSpace 4 жыл бұрын

That was another talk. Around here: kzbin.info/www/bejne/laLdfqOhYpdlmcU

@saulaxel 3 жыл бұрын

Someone knows where to find the world map of stl in high quality? I just find blurry ones.

@Astfresser Жыл бұрын

There was a talk you'll find it in the corresponding github

@jopa19991 4 жыл бұрын

So he shows how -O0 raw implementation outperforms the STL variant, then a bit later he shows how it loses to that when compiled with -Og. I supose we need to see the truth that lies behind -O2...

@Kalumbatsch 4 жыл бұрын

On my machine at least, STL accumulate on a recent gcc using -O2 or -O3 is 20% faster on huge arrays (2G or so) and about 90% faster on smaller ones. Even with -O0 it's only about 5% slower. I call bullshit on this whole talk, don't know how he came up with those numbers.

@blakebaird119 4 жыл бұрын

HE shows that the the raw C Acculator must faster than **un-optimized** STL, and that enabling any optimization level makes the raw C accumulator much faster. What he doesn't show is a slide with optimized numbers for Raw Accumulatr vs STL. He's just missing that slide but implies that isn't much of a difference

@zvxcvxcz 4 жыл бұрын

@Simon Farre The thing that he's ignoring is that C is easier to optimize. The compilers do not end up doing an equally good job. Sure, maybe one day, but for now, when you write simple C loops, that is nice and easy for the C compiler to vectorize. That gets harder and harder to interpret correctly as you write more complex C++. It's not even that the C compiler is better, it has an easier job because you wrote it the lower level way. You can write it and use the C++ compiler for both and the lower level version/simpler version for the compiler to understand will usually be better optimized.

@Astfresser Жыл бұрын

I figure both accumulators on O2 will produce the exact same assembly so any difference will be noise

@masondeross 11 ай бұрын

@@Kalumbatsch That part of the presentation was to show that even raw C is slow when compiled without optimizations, and even minimal 0G STL is as fast as O2 raw C accumulations. It was showing exactly what you are faulting him for not showing, that STL is competitive with raw C. He didn't go on to show it is superior, though, by showing the STL optimized since it seems he didn't realize people would think he was showing how good/bad the STL is when he was just showing that C is "also bad" when not optimized.

@georganatoly6646 4 ай бұрын

was not aware of the realloc constraints on map/set

@bernadettetreual 4 жыл бұрын

Ah well, it's the STL algorithms that come in handy, not the containers. Just make your own containers, loosely follow the STL interface, reuse the algorithms and you're good. I can't remember how often I stopped people at work to use std::function, since we have a tailored version.

@litmus3742 4 жыл бұрын

Local std::function s or a custom version are just silly when you've got auto. Something at class scope that strips out type erasure is fair enough though.

@zvxcvxcz 4 жыл бұрын

You must do a different kind of coding than I do. I write mostly scientific code... so it's mostly about implementing a new algorithm and they don't really intersect with the STL algorithms very often.

@MagnificentImbecil 2 жыл бұрын

@Lorash: As per suggestion in this video: Is it possible to publicly present/document the disadvantages of the STL and/or/compared-to the advantages in these other libraries you mention (e.g. regarding interface, performance) ? E.g.: Via SouceForge/Github/Bitbucket/Gitlab/Gitea/... repo. Or Twitter. Or Google Doc. Or blog. Somewhere where people can publicly comment/analyze. Otherwise, we are back to the Monkey Tale, where one person spreads myths and others are expected to follow mythology (facts they have never witnessed, only heard about -- from STL haters, too).

@omokokkoro2292 4 жыл бұрын

When talking about std::array, he said that std::array fixed size (which is understandable), but not capacity. What does he mean by that?

@ashtum 4 жыл бұрын

With std::array all objects must be constructed in the beginning, because it has no size that it can track how many valid object are in allocated space. I think he wants something like boost::static_vector.

@jeremydavis3631 4 жыл бұрын

He was saying that it has a fixed size rather than only having a fixed capacity. IIRC, his words were "fixed-size, not fixed-capacity", but it's impossible to have a variable capacity with a fixed size. His point was that it would be useful to have a fixed capacity and variable size, like in the alternatives he showed next, so that one could do things like push_back without the risk of unplanned allocations.

@williamkmanire 4 жыл бұрын

It means that all of the memory needed to store N copies of is allocated immediately. He suggests that you should be able to specify a maximum size for the array but not necessarily allocate the memory until it is needed. In practice, I don't understand the difference because it seems to me that the compiler would have to reserve the specified capacity whether or not it is used, making that allocated memory unavailable for other purposes. Hopefully he'll see this comment and clarify the utility of this.

@ian3084 4 жыл бұрын

He would like to keep track of the used elements, like being able to use push_back, pop_back etc. I guess to keep the interface similar, know have many items you use there (like in vector), but still have only one static allocation. That of course would make it a different thing than what std::array is. There are ready implementations of such things, like he said.

@MagnificentImbecil 2 жыл бұрын

Also, there is a difference between allocating (up to v.capacity () elements) and constructing (up to v.size () elements). The extra step for construction (of unused elements) might affect speed. Worse (IMO), it might affect the way the class of the elements is designed/written/used: Now we have to provide some sort of "unused" state. For example, this might result in offering a default constructor (where there had not been one before) and adding checks in (all) member functions... and in (all) classes for which objects might contain an element as a sub-object. :'( As the Presenter has said: "Messy...".

@smiley_1000 3 жыл бұрын

I wish this was more videogame-specific

@UrSoMeanBoss 2 жыл бұрын

Yeah, I always almost expecting to see apples-to-apples real world comparisons between STL and non-STL implementations in the context of a game engine or something and see how large of an impact it has, etc.

@aryanparekh9314 3 жыл бұрын

13:08 why on earth would you benchmark code in debug? That's like using a racecar with the handbrakes on

@MagnificentImbecil 2 жыл бұрын

That is what some/many people do and use the results of those "benchmarks" to claim that C++ and/or enabling exceptions and/or STL and/or Boost are slow !

@taragnor 2 жыл бұрын

Yeah honestly. Nobody cares if the debug version is slower. It's the optimized release that you care about.

@threepoundsofflax8438 2 жыл бұрын

It's important for video games where you want to be able to run the game in debug mode

@nishanth6403 2 жыл бұрын

@@threepoundsofflax8438 You mean for testing purposes ?

@user-nw8pp1cy8q Жыл бұрын

@@nishanth6403 Exactly. If you need to debug game but it plays at 0.01 FPS and takes half an hour to load a saved game, it almost impossible. For example, when I was making PRs to open-source game Cataclysm: Bright Nights which used STL, it was quite irritating when I need to run game in debug mode.

@pedromiguelareias 3 жыл бұрын

It's fixable. I mean, value-based programming, meager classes (rule of zero and serialization, that's all), variants for gui-like quantities, two custom containers (with small buffer optimization and all the goodies) and a digraph framework for relations between classes. Do it and call it C+-. It will save the language.

@Dolkarr 2 жыл бұрын

Implying it needs saving in the first place.

@abbas1552 2 жыл бұрын

What digraph framework can you recommend?

@pedromiguelareias 2 жыл бұрын

@@abbas1552 Boost, but I rolled my own.

@pedromiguelareias 2 жыл бұрын

@@Dolkarr It doesn't, but as it is, it can scare new developers :-)

@AM-qx3bq 4 жыл бұрын

Why did he say "You guys know you shouldn't be using list right?" What's the apparently commonly known problem with list?

@Bozemoto 4 жыл бұрын

They're almost always slower. Inserting and removing at a randomly selected index is way faster on the vector, even though it has to move the memory more cause linear traversal of memory is something computers are incredibly fast at doing. All the cache misses when traversing the list costs more. There should be talks about it.

@litmus3742 4 жыл бұрын

A list will typically have each node new'd up separately. If the whole list is allocated in one go then it's fine. But if that list is added to over a long period with many allocations in between ( I.e. over multiple frames in a video game ), then the list is likely to be scattered in memory. This means you're potentially having to fetch memory from ram on each iteration, on my old pc an l3 cache miss was about 25 times slower than a sqrt operation or roughly equivalent to a trig function in terms of cpu cycles.

@atimholt 4 жыл бұрын

@@Bozemoto There are talks about it. Haven’t watched them in years, though. I think it was a Stroustrup talk.

@keris3920 4 жыл бұрын

Use an aligned pool allocator if you're going to use a list. For example, see parser libraries like rapidxml for an excellent example of a doubly linked list that is more efficient than trying to expand a vector to insert data

@joestevenson5568 4 жыл бұрын

Lists are very slow compared to arrays in nearly every usecase. They’re basically cache-miss generators.

@frenchmarty7446 Жыл бұрын

He lost me with the performance comparison. Why didn't he show the performance of code using the STL with optimizations enabled? Was it implied that the raw implementation and the STL style compile to the same assembly when you compile with optimizations? Was his point that the performance penalty is insignificant in comparison? Overall, he's not clear on what he's actually getting at throughout his talk...

@brandonlewis2599 4 жыл бұрын

Ugh. Can we fix modulo? If it's possible to work around them, can the compiler emit better code? This seems like low hanging fruit.

@WutipongWongsakuldej 4 жыл бұрын

IFAIK, in x86 modulo would be implemented by single DIV/IDIV instruction. This instruction, however, takes longer than other arithmetic instructions if I recall correctly (like ... 5-10 times slower). It's probably possible to implement modulo operation using other instruction, but the result might or might not be better than just use DIV/IDIV instruction.

@sirmoonslosthismind 4 жыл бұрын

for a divisor that's a power of two, you can and should use a bitwise and instead of a modulo. assuming x and y are both unsigned: (x % y) == (x & (y - 1)) // only where y is a power of two

@suokkos 4 жыл бұрын

But modulo too has bad habit that in many datasets you get more conflicts than with other methods. Conflict reduction is important consideration for a generic implementation. I have seen recommendations that hash table should use prime sizes to minimize conflicts. Dynamic size with a prime size requires that you use hardware division operation or use quite a lot work for each resize to figure out inverse to use as multiplication. Runtime inverse multiplication selection is a fairly fast operation for most cases. But uses division operation and quite a lot other operations making it only beneficial if you repeatedly calculate same division. Even bigger values can be implemented fairly efficiently. Obviously there is still issue that calculating quotient requires only one multiplication while modulo requires second multiplication and subtraction at the best case. The would be unlikely to apply for case when input is a hash function with full word as a range of potential outputs. Of course you can be tricky and instead choose multiplicative inverse which gives you directly quotient in your size range. That requires hash function to make sure most significant bits are the most random bits. Already linked Fibonacci hashing solves both. Using golden ratio naturally has least amount of conflicts for most data sets. It also allows turning a division to a multiplication.

@absurdengineering 4 жыл бұрын

No need to “fix” the compiler. For hashing, the modulo is by a constant, so you can precompute its reciprocal. Sure, the compiler could do it for you, but do they? It’s kinda trivial to just code it explicitly. In my experience multiplying by reciprocal and subtracting from the dividend beats DIV on all architectures. It takes a single cycle since it’s your standard signed MAC operation and those are at least this fast. You can use “wide” intrinsics and do several hash computations per cycle that way. Maybe even your compiler will vectorize this for you! That is, unless an architecture speculates the DIV constant divisor (whatever was used last time), retains the reciprocal in the data predictor (a generalization of the jump predictor), and speculatively multiplies by the predicted reciprocal. In fact, this is a good test for detecting such architectural detail: if DIV is as fast as MAC by precomputed reciprocal on a given platform, but not as fast otherwise, then you know that this sort of speculation happens :).

@xl000 4 жыл бұрын

@Lorash Also look at how the compilers compile: multiplication by an integer constant: Example: f:x->x*40 It will decompose it in x8 = x >> 3 ; x32= x8 >>2; res= x8 + x32 .. or something like that. Or x=(x>>2+x) >> 3

@AinurEru 2 жыл бұрын

Oh, boy, so many invalid arguments in one video: 1) The strongest complaints about C++ performance actually relate to the very 3 topics that were mentioned at the beginning of the talk and not covered at all. - Allocators: The main issues about memory locality is that RAII pushes STL's overal design in the direction of allocating way too frequently. The only way to mitigate the negative effects of that is with a very good allocator story interwoven throught the design of all containers and types. Sadly even with ::pmr the overall design of STL is still very lacking there, being held-back by it's design-origins. - Exception: Almost always the first thing to be completely switched-off in games, and that tends to rule-out most of the STL - Compile Time: One of the biggest impactors on development productivity, especially at-scale, and the entire STL is heavilly templated (it's in the name) 2) Specification vs. implementation: There's complains on both, but most of the reasons being raised against the STL are in the design-space, not implementation. Suggesting otherwise, or that complaints aren't accounting for the distinction is just inappropriate/incorrect and smells like a cop-out argument. 3) Associative containers: Choice of hash function is completely orthogonal to the choice between open-adressing vs. linked-list approach. It's illegitimate to use any comparison of hash functions as a way to justify that linked-lists can be good enough. All the comparisons in the diagrams used linked-list approaches, they didn't compare against open-adressing. 4) Suggesting that 99% of C++ developers don't have use for faster associative containers of faster smaller dynamic arrays, is more than a bit of a stretch. 5) Suggesting that a large percentage of C++ developers are impeded from using associative containers that are less memory efficient is also a stretch. 6) Nothing is preventing the comittee from having additional associative containers and vector types that have different trade-offs in the standard - and yet they don't.

@hesselkeegstra4286 3 жыл бұрын

He claims (@14:09) the 80486 was de last Intel CPU implementing an in-order execution model. That claim does not hold. The successor to the 486, the original P5 Pentium series, was also still an in-order execution CPU. Intel's first out-of-order execution CPU was the P6 introduced in late 1995.

@Spartan322 3 жыл бұрын

Isn't that missing the point? His point was that (almost) nobody uses a CPU that relies on in-order execution, least not in the commercial video game market, so treating your code as if it does is nonsense.

@MagnificentImbecil 2 жыл бұрын

In this regard, Intel Atom processors, at least in the initial versions, are a regression: they have in-order execution, just like before the Pentium Pro (i.e. the P6), which has had out-of-order execution. COS: It is useful to note that various kinds of instruction-level parallelization of execution exist -- not just out-of-order execution. (And even out-of-order execution has many versions, e.g. Tuallatin vs Northwood vs Prescott vs Centrino/Core. That is where the battle among generations of Intel and AMD of CPU's has been fought.) Even the original Intel 8086 has had a separate Memory Management Unit. Regarding the 80486: It had cache integrated in the CPU, it had CMPXCHG, it had FPU integrated in the CPU, it averaged one instruction per cycle (more accurately: two instructions per two cycles). I hope I have remembered correctly. (-:

@Spartan322 3 жыл бұрын

Probably be more effective for game devs to try and contribute to the STL if it fails for their use case then to complain and just write it out by hand, least if that happened nobody would, in a few years afterwards, complain about the STL again. (aside from what they mostly do it for now, that being to beat down on the STL because it used to suck or was implemented poorly)

@Weaseldog2001 3 жыл бұрын

Exactly. And it supports overloads for memory allocations, circumventing one of the main complaints. A programmer can plug in their own allocation system. The only problem I've come across, is when I had the need to use a less-than operator that was very expensive. I discovered that the binary search implementations in STL, make extra calls to this operator as a side effect. I found it trivial to fix the search, so that it made the minimal number of calls necessary to this operator, resulting in a noticeable performance improvement for my code.

@user-nw8pp1cy8q 3 жыл бұрын

@@kishirisu1268 I would argue that UE4 actually uses it's own C++.

@oginer 2 жыл бұрын

Not every tool needs to be good for every use case. So the STL is not the best for games? That's ok, and I don't see anyone complaining about it (remember, stating facts != complaining). You seem offended that someone says STL is not the best of the best for every use case.

@askeladden450 2 жыл бұрын

@@Weaseldog2001 stl's allocator interface is probably one of the worst around. try using eastl allocators and you will never approach stl allocators again.

@devatsdb7 Жыл бұрын

@@Weaseldog2001 they already do write their own..and people complain why game developer never use STL??

@MrCOPYPASTE 2 жыл бұрын

There is a big misconception about STL usage in AAA games... STL isn't a problem... Crappy programmers are the ones that give C++ a bad name...

@__hannibaal__ Жыл бұрын

Yeah, C++ made by mathematicians, for programmers who think, good knowledge and have modern mathematic concepts(logic and group theory and operation calculus, … , Number theory(analytic also useful)).

@OREYG Жыл бұрын

This is not a misconception, stl was unfit to be used as main container library until recently, until the introduction of PMR. And even with PMR, we got overbloated abstractions for allocators, so you would never see an adoption. Another problem is that while gamedev is a huge C++ community, it barely participates in committee. So stl (and language) is shaped by web / high frequency trading, and there is a disconnect with gamedev.