CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"

Рет қаралды 110,659

7 жыл бұрын

CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/cppcon/cppcon2016
-
Modern programs’ performance characteristics are often dictated by their data. Whether the cache locality of data access, the size of working set, or avoiding costly memory allocation overhead. Unfortunately, the standard C++ library data structures range from adequate to terrible at controlling these aspects, and they don’t provide any of the core mechanisms needed for extremely efficient data structure design.
This talk will present the core concepts of designing high performance data structures in C++. It is based on years of experience in the LLVM compiler as well as several other large code bases. From these principles, the talk will propose a suite of data structures that provide performance without loss of generality or functionality. As much as this talk will present specific data structure designs, its primary intent will be to give an understanding of what makes these structures have greater performance than more naive approaches.
-
Chandler Carruth
Google
C++ Lead
San Francisco Bay Area
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Пікірлер: 71

@MarcusAseth 6 жыл бұрын

man, Chandler makes the camera man work really hard for his money xD

@kamilziemian995 Жыл бұрын

😂

@kamilziemian995 Жыл бұрын

Chandler Carruth talks are great. I try every one of them available on the YT.

@teekwick 7 жыл бұрын

Chandler is my favorite speaker after Herb, good talk!

@seditt5146 5 жыл бұрын

I agree, another one worth catching all his talks is Ansel Sermersheim. He makes hard to understand subjects highly consumable. Timur Doumler also has good talks but he is not the best public speaker, not the worst, he is very good, but not Herb or Chandler levels at all.

@barrettellisadair 7 жыл бұрын

I always enjoy Chandler's talks - this one has great video/audio quality as well. I would like to emphasize the fact that this pointer-packing stuff is implementation-defined, and generally pointless outside of a performance bottleneck. So, some of the code you don't see in the slides likely contains a lot of platform-specific configuration macros (but I haven't looked at these parts of the Clang code base yet). TL;DW: Cater to the cache, pack your pointers.

@MrAbrazildo 7 жыл бұрын

I supose you are talking about the 4-bits-packing stuff, instead of the array of unique pointers, for large objects, right?

@10se1ucgo 7 жыл бұрын

Agreed, he's definitely my favorite presenter.

@soonts 7 жыл бұрын

@20:00 Microsoft has very similar implementation in their ATL, since 2003. Their implementation is called CAtlPlex, and is used internally by many ATL containers.

@BlairdBlaird 2 жыл бұрын

Apple also uses that at the language level (of both ObjC and Swift IIRC). In fact on 64b they use not only the low bits but also the high bits, because all extant 64b implementations only have 48b address space at the moment (and x86-64 is architecturally limited to 52 bits).

@davidforgeas2235 7 жыл бұрын

Is there a significant performance benefit to use the hybrid 32-bit pointers and 64-bit registers ABI, namely x32 ABI? I don't mean for the generated code, but specifically for the compiler itself?

@SianaGearz 7 жыл бұрын

So you get an expanded register file with 8 extra registers, but register scarcity has not been a major issue since about Pentium Pro, since top of stack is actually aliased to an unknown number of hidden registers and the corresponding operations get removed at the opcode decoding stage. You'd get a slight bottleneck at opcode decoding but it was pretty much removed in the later iterations of Pentium III. There also doesn't seem to be any difference whatsoever in compiler performance between pure 32-bit and 64-bit ABIs, at least from my experience with GCC - don't know whether it extends to Clang, but I strongly suspect so.

@dat_21 6 жыл бұрын

Can you elaborate on "top of stack" aliasing? I don't find any mentions of it in intel optimization guide or Agner Fog's guides. Maybe you are talking about FP register stack?

@TheEmT33 5 жыл бұрын

i love this guy

@insideloop22 7 жыл бұрын

As usual, it is a joy to listen to Chandler's talks. I am working on an Open Source library of data structures for High Performance Computing and one of the first thing I had to do is to rebuilt a custom std::vector. One of the many reason I had to do that is that std::vector v(n); fills the vector with zeros which trashes the cache and prevents you to use the "first touch policy" which is extremely important for NUMA architectures such as dual socket nodes. Alignement is also an issue to get efficient vectorization in HPC. These things could be "solved" with allocators, but I just hate them as I don't want them to change the type of my containers. I have a few questions though: - Do you have a llvm::Vector or do you use llvm::SmallVector as a replacement for std::vector? Do you also use llvm::Vector as a replacement of std::array? - There is a BumpPtrAllocator, but is does not seem possible to use it with SmallVector. Do you have any design ideas for allocators? I just can't get them to work the way I way. I hate both the way they are done in the C++ standard library and the Bloomberg implementation.

@ChandlerCarruth 7 жыл бұрын

We use SmallVector pretty much everywhere. On very rare occasions we'll want what std::vector provides, but it's unusual. std::array is a totally different beast -- it is a statically sized array which has essentially no overlap with the vector use cases. I suspect there are some places where we could use std::array and currently use llvm::SmallVector, but I think they're rare.

@ChandlerCarruth 7 жыл бұрын

And regarding allocators -- for LLVM, the pattern we fall into more often is needing to control allocation of the *objects* with allocators, not the buffers in the data structures. For those buffers, tcmalloc, jemalloc, or any of the other optimized free-list based malloc libraries will do a fine job.

@henrikvallgren1943 7 жыл бұрын

I tried to get attention to this issue a while back: groups.google.com/a/isocpp.org/forum/#!msg/std-proposals/b946VongJjY/RYYj--v63qsJ

@insideloop22 7 жыл бұрын

There are too many things in the standard library which are broken for me and I am glad I decided to write my own containers. It provides me so many useful things. As I am working mainly on numerical simulation, all I really need is a good std::vector implementation. Having my own allows me to: - Have an il::Vector v(n) that does not fills the vector with 0 (same with float, int, etc) - Have an il::Vector v(n) that gets initialized to NaN in debug mode (very useful if you want to track uninitialized values) - Add my own constructor such as il::Vector v(n, il::align, 64) if I want to align the buffer to a cacheline - Easily count the number of copy constructor called in il::Vector (if I want to make sure that the move semantics are properly done) - Having il::Vector::size() return a signed integer (I hate unsigned integers and all the bugs they lead to) - Write a il::SmallVector as Chandler has pointed - so many useful things... I like the language C++, but I just have so many troubles with the standard library. I am glad to hear that LLVM worked on their own containers. I can point people who say I am stupid not to use the standard library to this kind of talk.

@MrAbrazildo 7 жыл бұрын

So, did you already write your own, or just want to?

@tribalfromthetrapdontrappi3030 6 жыл бұрын

Nice!

@enhex 7 жыл бұрын

12:00 Can alias templates be used to avoid those mistakes?

@ChandlerCarruth 7 жыл бұрын

Yes, but at the expense of having to come up with a new name for each different small-size-optimized container in a function's API boundary. =/ Doesn't seem to scale very well.

@xfreeman86 5 жыл бұрын

You can use the same technique you use to remove the size parameter from the type of SmallVectorImpl. Allocators are passed by reference, so no slicing worries, and you can use the pimpl idiom for polymorphic copying. My jury is still out on whether it's better, but it is possible.

@Cons-Cat 2 жыл бұрын

I'm quite sure that the paramaterized allocators problem is easily solvable with C++20's `concept` feature now.

@ivanzhechev4215 11 ай бұрын

Slowly learning C++20, but how would you achieve that? The only way I can think of is rejecting incompatible allocators and giving a compiler error, which doesn't really achieve the result of having the SmallVector on API boundaries.

@JohnDlugosz 4 жыл бұрын

6:17 Slide 6 -- How does the SmallVectorImpl::push_back know whether to free the old buffer? Without knowing the original N, it can't tell if it's been reallocated to the heap already, or was still using the small buffer.

@pazdziochowaty 4 жыл бұрын

It can check if Begin points right after SmallVectorImpl object - this is where derived class keeps the buffer

@max0x7ba 7 жыл бұрын

Bug in slide 21: `(Size >= (End - CurPtr))` should be `(Size > (End - CurPtr))`.

@stdafx 5 жыл бұрын

In the original source code it looks a bit different: "if (Adjustment + SizeToAllocate

@spudtaters8419 7 жыл бұрын

The smallvector slide at 5:26 is confusing me because of omissions: 1) Can all small size vectors be upgraded to heap storage on overflow 2) Is SmallVectorImpl what you use for function params? I assume so, it just sounds like a wierd / wordier name.

@insideloop22 7 жыл бұрын

1) Yes, with SmallVector v; they can call v.resize(10); 2) The name is a bit weird, but I don't know which one they could have chosen

@xfreeman86 5 жыл бұрын

For 2), the names should have been switched. Then you allocate SmallVectorImpl but pass around SmallVector.

@sebastianmestre8971 5 жыл бұрын

@@xfreeman86 what about allocating SmallVector and passing around BasicSmallVector

@SolomonUcko 3 жыл бұрын

Is this the previous talk? kzbin.info/www/bejne/nHmxnoWhr917jdU ("Efficiency with Algorithms, Performance with Data Structures")

@wintermute701 7 жыл бұрын

The talk from 2 years ago he mentions in the beginning: kzbin.info/www/bejne/nHmxnoWhr917jdU

@Shockszzbyyous 7 жыл бұрын

ok i see so the idea behind smallvector is that you can allocate some space before hand? instead of allocating when pushing something on to the vector array ?

@tomcheng3903 6 жыл бұрын

No, the point is that you allocate it on the stack, not the heap. If it's too large for the stack, then it'll fallback to the heap. You can allocate space in a vector beforehand with vector::reserve().

@onosendaiAAB 7 жыл бұрын

Hi Chandler, I may be misinterpreting you, but at kzbin.info/www/bejne/rHbPi5Zsr7h8jq8 are you implying that modern CPUs do indirect memory prefetching? e.g. are you implying that CPUs will work out you are iterating through an array of pointers, read one or more indices ahead in that array, and then prefetch the memory pointed to by that pointer?

@deadalnix 7 жыл бұрын

Modern Intel do it to some extent. I'm not aware of any other architecture doing it.

@onosendaiAAB 7 жыл бұрын

Interesting, do you have any further information or references about that?

@onosendaiAAB 7 жыл бұрын

I've just done some tests on my Ivy Bridge, and I'm pretty sure it doesn't do indirect memory prefetching.

@deadalnix 7 жыл бұрын

Note that prefetching on pointer load is bound to be slow no matter what. If you want to measure if the CPU does prefetching or not, make sure you use the same algorithm, but one doing prefetching explicitly using compiler intrinsic, and the other letting the CPU do its thing. If you see no significant perf difference, your CPU does prefectching. If the one with prefectching is faster, then you CPU doesn't do prefecthing. I wouldn't be able to tell you if your specific CPU does it. Some Intel do it, but I'm not sure which ones.

@noahgoldstein3951 2 жыл бұрын

@onosendaiAAB I think you're right there is no indirect prefetching. Possibly he means that the execution of later iteration loads can occur speculatively and in parallel to the first load because there is no dependency.

@MrAbrazildo 7 жыл бұрын

I don't get the point about SmallVector class having a fixed size, whereas the containers goal is to change it. I can only guess that 'N' is the maximum size the "simulated container" will ever get, right? But how would you know that?

@ChandlerCarruth 7 жыл бұрын

It's not a fixed size -- it's an initial *allocation* size. The container still is dynamically sized like std::vector.

@MrAbrazildo 7 жыл бұрын

+Chandler Carruth I saw the push_back f(), however, it will produce just 1 element. And even if it create another buffer, how Begin and End would work with the 2 or more arrays, without an extra code to handle the situation? I guess you are talking about low-level, which would allow you to delete an compiling-time array, and realocate it, unlike C/C++ allow to do, right?

@ChandlerCarruth 7 жыл бұрын

No, this is totally built with what C++ already allows you to do. The begin and end pointers start off pointing at the inline small buffer, and if extra space is needed it gets allocated and the begin and end pointers are moved to point at the new buffer. You can find all the code for the real implementation here: github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/ADT/SmallVector.h

@MrAbrazildo 7 жыл бұрын

Chandler Carruth So, the penalty is risking to have an unused array, after the memory enlarge. Good idea, though. The pointers can also return to the array, if memory is freed. How much an array is faster than std::list? For GCC, some years ago, I measured that an array, simulating std::map, is 6-7 times faster.

@von_nobody 7 жыл бұрын

btw How about intrinsic lists? Or when you say "list" you mean list `List` instead of `List`? www.codeofhonor.com/blog/avoiding-game-crashes-related-to-linked-lists This have very good properties if list are very dynamic and object can move between lists. Of corse it have one big drawback, object can belongs only to one list at time.

@MrAbrazildo 7 жыл бұрын

_Value.template_, I don't even know what this means.

@barrettellisadair 7 жыл бұрын

It is a disambiguation for accessing a templated member of an object that is itself a dependent type.

@xfreeman86 5 жыл бұрын

It's hard to parse as a human, and you just happened to get it wrong is all. Read it as (Value).(template is)(). It's for when you call a template member function with type parameters that cannot be deduced from its formal parameters.

@NXTangl 5 жыл бұрын

@@xfreeman86 ...wait, you're Gordon Freeman's brother, aren't you?

@NoNameAtAll2 3 жыл бұрын

black magic

@MrAbrazildo 7 жыл бұрын

> 5:00, in derived class, you could had use the base constructor instead (C++11 and above): _using SmallVectorImpl::SmallVectorImpl;_. > 8:36, at least for GCC, _std::pair_ drops performance by the half, compared to 2 separately values. It was shocking to me (absurd geek moment)! However, I guess that an std::array , and a first and second f()s, fix this issue. _Edit: nope. This array is as slow as std::pair._

@EvanED 7 жыл бұрын

Re. pair, what are you comparing? Separate correlated arrays for keys and values (e.g. 'KeyType keys[N]; ValueType values[N];) vs an array of pairs?

@MrAbrazildo 7 жыл бұрын

+EvanED I'm comparing: 1) int m; int n; 2) std::pair p; 3) class Pair { std::array a; public: inline const int &first () const {return a[0];} inline int &first () {return a[0];} The same for second, returning a[1]; }; The case 1 is more than 100% faster than the other ones. I also tryed to make a 4th case: packing a pair inside a size_t variable, which turned out to be more than 100% slower than case 1 too.

@Sopel997 7 жыл бұрын

the thing is array is homogenous

@alexeiz 7 жыл бұрын

Contrary to what Chandler says, bit packing is slow. You won't find it in use very often in the modern code. Bit packing makes sense only if you save substantial amount of data size (megabytes).

@MattGodbolt 7 жыл бұрын

Do you have any qualification for that? On the machines I use (Haswell server class), bit packing is almost free, taking around 2/3rds of a CPU cycle to shift and mask.

@DeGuerre 7 жыл бұрын

Bit packing almost always wins if it means your working set fits in cache, or for anything that is to be transmitted over a network.

@deadalnix 7 жыл бұрын

Bit packing: 1/2 cycles. L1: 3-4 cycles If your data fit in L1, then bitpacking is probably not worth it. L1 is 32k. L2: 10-12 cycles If your data fit in L2, it's probably a good idea, but not sure you'll win a lot. L2 is 256k. If you manipulate anything bigger than this, bitpacking is DEFINITIVELY worth it. Most programs are here. Caveat: bitpacking makes your code bigger, so if you are icache bound - you have a large executable with fairly flat profile - then you may want to reconsider.

@SerBallister 7 жыл бұрын

Depends on your usage case, if you are primarily setting/clearing bits then it will need more instructions compared to say, using an array of bytes, but if you are testing them, particularly multiple bits at the same time then you make savings.

@nameguy101 7 жыл бұрын

Found the Javascript coder.