CppCon 2016: Fedor Pikus “The speed of concurrency (is lock-free faster?)"

Рет қаралды 36,768

7 жыл бұрын

CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/cppcon/cppcon2016
-
This talk takes the “ultimately practical” approach to concurrent programming, with a focus on lock-free programs: after all, in reality such programs are almost always written in the hopes of getting better performance. We’re going to measure performance of the individual concurrent primitives and their effect on the overall performance of the whole program.
The goal of the talk is two-fold. On one hand, I will show a set of tools and practices that can be used to get quantitative measurements of the performance of different implementations under various load conditions. Mastering these techniques will allow the attendees to choose their concurrent algorithms and implementations based on solid data instead of guesswork or “common knowledge” (which is often wrong or outdated). On the other hand, even with the focus on real-life applications we can learn a few things about the fundamental nature of concurrent programs. This understanding comes especially useful when dealing with the “common knowledge” and “simple logic”. For example, it’s “common knowledge” that lock-free programs are faster than lock-based (not always). It’s also a “simple logic” that the hardware must protect shared memory access in a multi-core system, so ultimately locking is always present (sometimes true, sometimes true but misleading, and sometimes false). It is both “common knowledge” and “simple logic” that a wait-free program does not wait (but if your definition of wait is “will I have to wait for the results after I finish my coffee?” then it definitely does).
We will explore practical examples of (mostly) lock-free data structures, with actual implementations and performance measurements. Even if the specific limitations and simplifying assumptions used in this talk do not apply to your problem, the key point to take away is how to find such assumptions and take advantage of them in your specific application: after all, in reality it’s almost always about performance.
-
Fedor Pikus
Mentor Graphics
Chief Engineering Scientist
Portland, Oregon Area
Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp. His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to software industry. His responsibilities as a Chief Scientist include planning long-term technical direction of Calibre products, directing and training the engineers who work on these products, design and architecture of the software, and research in new design and software technologies. Fedor has over 25 patents and over 90 papers and conference presentations on physics, EDA, software design, and C++ language.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

Пікірлер: 9

@RealVercas 7 жыл бұрын

I really hoped you'd test ticket locks as well. These are much more efficient than binary spinlocks in kernels (where there is no interruption).

@olaliljedahl9214 5 жыл бұрын

Write contention is the primary scalability limit for write-invalidate cache coherency protocols. Each thread that attempts to update a cache line must fetch the line (in exclusive or modified state) and invalidate all other copies. Lock-based and lock-free designs that have the same memory write behavior (from the perspective of the CC protocol) will also scale and perform similarly. There are pitfalls to watch out for though, e.g. CAS may request the line in exclusive state even if the comparison fails and the update is not performed ("false write contention"?). Similarly, certain spin lock implementations (using exchange instead of compare_exchange) can invalidate the line with the lock in the lock owner's cache even if the lock is already busy. CPU microarchitecture, interconnect (inter-CPU) latencies and thread count/contention level also influence the actual impact on performance and scalability.

@MohammadHusain 7 жыл бұрын

Great talk!

@user-yc5fq9bv3u 2 жыл бұрын

18:11 I do not believe that spinlocks are power-efficient at all. If atomic access is taking CPU time but not wasting power it's probably more efficient even though it's longer.

@hanyanglee9018 2 жыл бұрын

I'm curious, if a singly linked list or queue is only for task dispatch, children thread would only get their one node and leave. They only have to modify the head, and then they consume their own nodes. This way is way simpler.

@MatthijsvanDuin 5 жыл бұрын

On plenty of architectures (e.g. ARM) incrementing an std::atomic is not wait-free

@olaliljedahl9214 5 жыл бұрын

ARMv8.1-a introduces atomic instructions such as LDADD and STADD. There could still be delays due to contention and I don't know to what extent fairness is guaranteed.