CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”

  Рет қаралды 99,829

CppCon

CppCon

Күн бұрын

CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2017
-
Hash tables consume a large volume of both compute resources and memory across Google's production system. The design for hash tables in C++ traces its origins to the SGI STL implementation from 20 years ago. Over these years, computer architecture and performance has changed dramatically and we need to evolve this fundamental data structure to follow those changes. This talk describes the process of design and optimization that starts with std::unordered_map and ends with a new design we call "SwissTable", a 2-level N-way associative hash table. Our implementation of this new design gets 2-3x better performance with significant memory reductions (compared to unordered_map) and is being broadly deployed across Google.
-
Matt Kulukundis: Google, Senior Software Engineer
Matt is a senior software engineer on the C++ libraries team at Google. Prior to Google he has worked on compilers, machine learning, and underwater robotics. In his free time, he scuba dives in warm places.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Пікірлер: 84
@user-lt9oc8vf9y
@user-lt9oc8vf9y 3 жыл бұрын
"turns out we can... and it fits on one slide" *slaps 3 pages of template meta programming on one slide*
@Florian-sh9nf
@Florian-sh9nf 6 ай бұрын
This guy kinda feels like Tom Scott but for C++. I love it!
@jg5275
@jg5275 3 жыл бұрын
came for the hash table, stayed for the jokes.
@JonMasters
@JonMasters Жыл бұрын
This 👆
@marcusklaas4088
@marcusklaas4088 5 жыл бұрын
Ignore the haters lol. I thought you were fucking hilarious dude. Just a tough crowd
@allanwind295
@allanwind295 2 жыл бұрын
Great talk, and I like the progressive way of introducing the material.
@CppCon
@CppCon 2 жыл бұрын
Thank you!
@IllumTheMessage
@IllumTheMessage 6 жыл бұрын
Looking forward to eventually trying this out.
@anon_y_mousse
@anon_y_mousse 2 жыл бұрын
This is definitely a good talk. I love the subject matter and am quite surprised by the results. I would've figured probing would be slower no matter what, but using a bitmap jump table is a pretty good idea. One of my many attempts had a binary tree for each chain, which was horrible for locality and memory usage, but still sped up the prior flat array approach, somehow. I guess because I didn't insert or delete often, and truthfully inserts were never slow either way, but it did speed up finds, especially if you iterated the entire collection and found each item recursively. This method looks a lot better though, I'll have to do an implementation. Also, nice job on the using powers of two for the table sizes so you can bitmask for the index. I found that trick myself back in high school and was shocked that no one used it, now I finally am seeing it all over.
@SaHaRaSquad
@SaHaRaSquad 7 ай бұрын
What a great talk. Both interesting and entertaining, and I actually learned something. The humor reminds me a bit of Zack Freedman/Voidstar Labs.
@TheEVEInspiration
@TheEVEInspiration 6 жыл бұрын
@Matt Kulukundis, A few ideas to explore: As two consecutive full groups likely mean the table is near load / needs a resize anyway, why keep looping until a matching or empty slot? Instead detect the issue at insert time and execute some other logic in case of a full group. If its a table of one or two groups, do as you do now. If there are more groups, see if the just checked group has an "extension group" assigned to it (one per table). If so, check this group first before trying the next/last group. If it has no extension group, just check the next/last group. The first group to fill up is thus extended by doubling it, which might stave of the need for growing the table as a whole. It could be a small price to pay compared to potentially doubling the size in a growth action. It also keeps the data dense for longer, at least if the hash distributes well enough. I haven't tested this, but it seems feasible. For very large tables one can consider having a more spillover groups to compensate for the increased cost of growing. In this scenario I would use some low-order index bits of the group that is full to point towards the one extension group that must match it.
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
Thanks for the ideas, I will add them to the queue of experiments :)
@ReiniUrban
@ReiniUrban 5 жыл бұрын
I would certainly also try to special-case small tables, like 32-128 entries, and replace with a linear scan. This will find the hidden constant overhead.
@disk0__
@disk0__ 6 жыл бұрын
"I'll give you for a moment to read it while I drink some of this lovely, lovely, diet Mountain Dew"
@Runoratsu
@Runoratsu 6 жыл бұрын
Oh, "diet"! I understood "dead", like "gone completely flat". m)
@lucacavalleri8649
@lucacavalleri8649 3 жыл бұрын
24:38 "One and a half bits, that is slightly awkward." It is actually log2(3) bits i.e. ~ 1.58496
@bernadettetreual
@bernadettetreual 4 жыл бұрын
Love his style!
@astro511
@astro511 6 жыл бұрын
Top-notch technical work and an insightful presentation, thanks! Do you know what effect or benchmark configuration is responsible for the surprising result that Large Payload is faster than Small Payload for flat_hash_set (around minute 35)?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
Honestly not sure. I suspect that the the large payload one is a slight outlier and if we rerun those we would see it come up to be about the same as Small Payload.
@TheEVEInspiration
@TheEVEInspiration 6 жыл бұрын
+Matt Kulukundis, I made a post with a few ideas that might be beneficial and just reply here so I am sure you get a notification. kzbin.info/www/bejne/pJSrnniLoq-NnJY&lc=UgyIBvVL_JNXGPZhY4B4AaABAg I hope it makes sense and does what I expect it to do.
@brenogi
@brenogi 6 жыл бұрын
I wonder how many people are writing the code themselves based on this talk, just for fun :D
@JiveDadson
@JiveDadson 6 жыл бұрын
I would definitely do that if I could jump to the place where he describes the structure and algorithms. A third of the way in, all I have seen is how to do it badly. Plonk.
@brenogi
@brenogi 6 жыл бұрын
I took some time to implement each of those steps, up to the table with the extension. Fun ride, and it works! :D
@figfox2425
@figfox2425 5 жыл бұрын
I am.
@abhisheksingh-mj6kb
@abhisheksingh-mj6kb Жыл бұрын
Going to write in C.
@vipuljain4494
@vipuljain4494 3 жыл бұрын
insightful
@laurivasama593
@laurivasama593 6 жыл бұрын
When removing with arbitrary group positions, even if the group that the element was found in has empty slots, there may exist another group containing the same element and no empty slots. Do you check for empty slots in both the group ending and the group beginning with the to-be-removed element, in order to know whether the slot may again be marked as empty, or must be marked as deleted?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
With arbitrary group positions you have to verify that [pos-15, pos+15] does not contain a run of 16 non-empty items.
@m.j.rayburn3
@m.j.rayburn3 10 ай бұрын
I wonder if the elephant in the room[1][2] is that you are ostensibly guaranteed to have two cache misses as a result of having two separate arrays. If this is correct, I further wonder if the reason it's not addressed is that the solution to this problem is what separates the Google version from the Abseil version. The mitigation is to do simultaneous memory fetches[3], and while I'm guessing you could do a better job of that with assembly instructions by way of forcing a greater than one cache line prefetch (I'm no expert on assembly), all that's necessary (post the sliding window optimization landing[4][5]) is to store the key in the first probe location in a variable before probing the meta data array[3]. This would create an additional branch in the subsequent code, but that could be eliminated by replacing "Key first_key = m_key_array[index]" with "volatile Key first_key = m_key_array[index]" and then ignoring the variable. [1]: kzbin.info/www/bejne/pJSrnniLoq-NnJY [2]: kzbin.info/www/bejne/pJSrnniLoq-NnJY [3]: larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss [4]: kzbin.info/www/bejne/pJSrnniLoq-NnJY [5]: kzbin.info/www/bejne/gIuoZJJmptulg8k
@marcel_pi
@marcel_pi 6 жыл бұрын
Very nice talk! One question though: Is there a particular reason for the slot count of a group being 16? Why not 32 for example?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
We tried 32 and didn't see any performance win from it, so we went with the one that allows for smaller tables
@cutekitten9856
@cutekitten9856 6 жыл бұрын
Matt, thanks for the awesome talk! I have a question: can you please clarify what was the "green" map in these stacked graphs and why it was being used over std::unordered_map and dense_hash_map?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
`__gnu_cxx::hash_map`. It was an extension to the library that is similar to std::unordered_map but predates C++11.
@xavierthomas1980
@xavierthomas1980 6 жыл бұрын
Great talk, thanks. When you say "the first 7 bits" does this mean "the 7 least significant bits of the most significant byte" on a little-endian system? Also, I tried to search for this bit distribution entropy thing but get mostly results for crypto hash functions that discuss the distribution of the hash function outputs over the output space which sounds a little different to my ears, because we can still be uniformly distributed among the output space without any entropy in the LSB. If you could provide a link regarding this I would appreciate. And just to confirm my toughts: Would MurmurHash2 or 3 comply with the requirements (I think they do) and would using a 32bit hash value on 64bit system cause an issue (I see no issue with this beside diverging from the standard)? Thanks
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
MurmurHash is good. It happens that a 32bit hash will work if your table has fewer than 2**25 elements, but that is not a guarantee we will provide officially.
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
Realizing I didn't answer you question, it is currently the least significant bits, but we don't guarantee anything on that front
@xavierthomas1980
@xavierthomas1980 6 жыл бұрын
Thank you for the feedback, I think I will set to code a quick version of this, it all seams quite accessible (removing the asm optimization and smart stuff to figure out the best performing version of what the user want).
@immanuelhaffner6915
@immanuelhaffner6915 4 жыл бұрын
You can use a so called "bit mixer" to populate all bits with the entropy of only a few bits. Murmur hash's "finalizer" is known to be a good bit mixer. A detailed discussion about Murmur3's finalizer can be found here: zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html. I should add that the article claims the bit mixer adds entropy, but that is not correct.
@brenogi
@brenogi 6 жыл бұрын
Have you guys played with prefetching? I honestly never got it to work positively myself, but on the other hand I've never went down to SSE instructions either. Perhaps you could prefetch the data before doing the lookup on the metadata, so by the time you find a match (if any) the data could already be in place or on its way.
@brenogi
@brenogi 6 жыл бұрын
By the way, I'm implementing a hash table using the ideas from this talk. Apparently the prefetch did give me 15% or so, for the use cases I'm directing my benchmark.
@brenogi
@brenogi 6 жыл бұрын
On a second thought, maybe my implementation is just too crapy yet that there is time for the data from the prefetch to arrive...
@brenogi
@brenogi 6 жыл бұрын
On a third thought, for large tables the metadata will be a cache miss, and prefetching the data before accessing the metadata gives it a good chance for the data to be there if it turns out to be necessary. That's what I observe in my implementation.
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
We have played with it but haven't concluded firmly yet on it. Intuitively issuing a prefetch for the data before accessing the metadata should be a win for find hit cases.
@saurabhnsit2001
@saurabhnsit2001 4 жыл бұрын
Very insightful talk. Is the flat_hash_set is made open source now? If yes, can you please provide pointer to search the same on internet?
@matthewfowleskulukundis3714
@matthewfowleskulukundis3714 3 жыл бұрын
github.com/abseil/abseil-cpp/blob/master/absl/container/flat_hash_set.h
@frostiebek
@frostiebek 6 жыл бұрын
Really inspiring presentation, Any reason why you chose 128 bits SSE over 256 bits AVX2, or even AVX512 ? Did you get the chance to test those ?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
We have done tests with 256bit versions and did not find a win over 128. We have not tried 512
@ulfjohansen2139
@ulfjohansen2139 4 жыл бұрын
I don't get the example at 38.35? He says it is going to pick the r value insert, but that's wrong. It is going to pick the templated forwarding reference one, which will forward the pair as a reference, so I don't get how he can measure any difference? Maybe I misunderstood something but I even tried the code on VS2019 and there were no additional copy.
@pruibiebehastoet1914
@pruibiebehastoet1914 2 жыл бұрын
It's going to pick the templated r-value ref overload which is a better match for non-const types.. ... insert( const value_type& value ); // not this one template ... insert( P&& value ); // but this one
@user-hj4rb1cp3t
@user-hj4rb1cp3t Жыл бұрын
@@pruibiebehastoet1914 hi, I understand it is picking the template-insert overload. But why there would be a copy? I think the template is instantiated as `pair insert(P& value)`, so it is still a ref, and then forward into string's constructor. I can't figure out why there's a copy😨, would you please explain a little? thanks
@user-hj4rb1cp3t
@user-hj4rb1cp3t Жыл бұрын
okay, I think figured out. It is because inside the template-insert, for gcc's implementation(v12.1), it will contruct a temp key. clang seems don't have this problem
@llothar68
@llothar68 6 жыл бұрын
Waiting for it. Please make it open source.
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
I am working on it. I cannot make any hard commitments or guarantees, but it is high on my priority list.
@PeterGoodman
@PeterGoodman 6 жыл бұрын
Any movement on this front? Would love it as a new year's gift!
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
I would too. There is some internal movement (although clearly I would have lost any bet I made). I can't make any commitments, but we are working on it.
@goauld88
@goauld88 5 жыл бұрын
It's open source now on github abseil
@brenogi
@brenogi 6 жыл бұрын
Is there any concern on the size overhead of the object itself? Are you guys optimizing for that as well? I sometimes face situations where there are millions of small hash map instantiations (and some of those are large, so I need them to be succinct). And some compact hash map implementations I found go up to 88bytes or more on the object itself. That kills the gains from the implementation being compact.
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
sizeof(flat_hash_map) is about 40 bytes. We would like it to be smaller, but that has been a lower priority optimization. We might get it down to 32, but we need to ensure no performance regressions doing it. A zero element flat_hash_map allocates no memory at all.
@brenogi
@brenogi 6 жыл бұрын
Sweet! I can't wait :) 40 bytes is already far better than the 56 from STL (godbolt.org/g/Uiiau8).
@bikereview3117
@bikereview3117 6 жыл бұрын
40/56 bytes only in stack but not in heap ;)
@brenogi
@brenogi 6 жыл бұрын
Ivan Ivanov yes I know. Those "stack" 40/56 bytes make a difference in application where millions of hashmaps are allocated. I say "stack" because if you allocate the hash map on the heap, then it's on the heap :)
@greg7mdp2
@greg7mdp2 6 жыл бұрын
Check out my small hash optimization: github.com/greg7mdp/sho
@jacobschmidt
@jacobschmidt 3 жыл бұрын
did y'all ever consider doing h & (b-1) instead of h % b for the starting bucket. it just seems that you need something to truncate h to be in range [0,b).
@jacobschmidt
@jacobschmidt 3 жыл бұрын
nevermind! looking at 51:30 it seems like y'all already do
@greg7mdp2
@greg7mdp2 6 жыл бұрын
release date as open source?
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
We are not committing to anything, but we are working on it. I know it is frustrating to wait. We are balancing the value to the community of releasing it as part of Abseil and the value we get from doing internal migrations of our code base.
@atomicgray
@atomicgray Ай бұрын
Is the new hash table faster than vectors for iterations?
@hampus23
@hampus23 Ай бұрын
Depends on what you are trying to do
@TimAnderson-ye4rq
@TimAnderson-ye4rq 3 ай бұрын
11:45 why not just put it at the second slot of the linked list?
@Bittboy
@Bittboy 6 жыл бұрын
So like uhh, if this Matt dude isn't from Southern California, that'd be like...shocking? I never knew the "surfer dude" went on to work at Google :P -- But really though, I found this talk fascinating! Great job on the presentation and keeping the mood real light!
@mattkulukundis6638
@mattkulukundis6638 6 жыл бұрын
Actually I am from Minnesota. Although I attended college on the east coast.
@goauld88
@goauld88 5 жыл бұрын
Finally open source! abseil.io/blog/20180927-swisstables
@projectshave
@projectshave 5 жыл бұрын
Excellent talk! I have the same sense of humor, so I thought it was funny too.
@marinversic9469
@marinversic9469 4 жыл бұрын
I must say I don't understand bashing on the jokes in the comments. I found most of the jokes to be quite amusing
@TheEVEInspiration
@TheEVEInspiration 6 жыл бұрын
Nice work I will say, but boy is C++ in a messy state when its even hell for library makers. I know they like hell as it allows them to show off their cryptic ways, but its still a big loss when it comes down to it. I used to love C++, but only use parts of it due to the clutter. With all the new layering of crap on top of crap, it looks less and less appealing to ever go back to. Its a reasonable question to raise what knowledgeable developer will seriously consider it now as a productivity language. And without actual users, there is no need for library makers and the language slowly dies out. Reminds me of games where devs focus on the wrong things, it never ends well. Still, good to see a focus on efficiency and daring to break baseless standard requirements that are of use to no-one. We need much more of that in this field. Simplify our languages and other interfaces and work more declarative instead of coding everything out all the time. Current languages are all wrong in these ways as the true problems people need to solve are very similar and could be described compactly and little code. The focus must shift from making tools that satisfy tool makers to making tools that satisfy end-user requirements. This hash is a pretty good example of better priorities. I wish this was done at the level of languages and core products, such as databases too. In fact these should be integrated much more, along with data visualization tools.
@anon_y_mousse
@anon_y_mousse 2 жыл бұрын
I don't know if you'll see this 4+ years later, but 100% agreed. Solve new problems, don't rewrite old solutions.
@johnjonjhonjonathanjohnson3559
@johnjonjhonjonathanjohnson3559 6 жыл бұрын
GOML ass-embly optimizer
@KabelkowyJoe
@KabelkowyJoe Жыл бұрын
In first 3minutes of unfunny jokes, he almost lost me, skipped to 23:00 (skipping unordered map) i will probably watch the rest EDIT 25:00 good point "we want use all bits well" but another joke.. OMG how old is he? 8? Point was less collision in averange adjusted for data set distribution, i came at this moment with "stupid idea" of making translation table for data -> table[data] -> result data for hash calc, statistically adjusted for your data distribution (in first testing pass) to use more bits of data for numbers/letter seen most of time, less bits for numbers range seen less frequent, focusing more bit on local context (local order) noticed that some hashes uses sum and xor in such way. Sum is a broad context of data, xor is local context, wont loose any data, shifts are just to "position data" cut off useless bits, % prime numbers are to do same with but of sums.. Am i correct? I assume all "researtch of perfect hash" is just attempt to adjust hash to most common data distribution to have less collisions. Since for example ASCII uses 5-6 bits at max, any data shifting 2-3bits usually helps. Probably what i just wrote is stupid.. Question? Watch or not to watch more?
Did you find it?! 🤔✨✍️ #funnyart
00:11
Artistomg
Рет қаралды 122 МЛН
1🥺🎉 #thankyou
00:29
はじめしゃちょー(hajime)
Рет қаралды 77 МЛН
Dynamic #gadgets for math genius! #maths
00:29
FLIP FLOP Hacks
Рет қаралды 18 МЛН
СҰЛТАН СҮЛЕЙМАНДАР | bayGUYS
24:46
bayGUYS
Рет қаралды 756 М.
Faster than Rust and C++: the PERFECT hash table
33:52
strager
Рет қаралды 509 М.
CppCon 2017: Chandler Carruth “Going Nowhere Faster”
1:00:58
Branchless Programming in C++ - Fedor Pikus - CppCon 2021
1:03:57
Your understanding of evolution is incomplete. Here's why
14:21
Hash Table in C
2:11:31
Tsoding Daily
Рет қаралды 56 М.
How PNG Works: Compromising Speed for Quality
32:00
Reducible
Рет қаралды 625 М.
CppCon 2016: Jason Turner “Practical Performance Practices"
1:00:29
Did you find it?! 🤔✨✍️ #funnyart
00:11
Artistomg
Рет қаралды 122 МЛН