Eye-opening talk. That trick with replacing if's with templates was simple and neat.
@MrSam1ization4 жыл бұрын
we have if constexpr now which is basically the same thing
@justinkendall83415 жыл бұрын
I've learned an immense amount from CppCon's uploads, this video included. Great talk, Mr. Cook.
@gnir65182 жыл бұрын
Says in the description he holds a PhD, so Dr. Cook 😄
@tissuepaper99623 жыл бұрын
"I did a runthrough this morning, threw away about half the slides." I felt that one. There's always too much to say and not enough time to say it.
@narnbrez4 жыл бұрын
One of those jobs where finding a trick like this makes the company your salary back in earnings
@Moriadin3 жыл бұрын
Been looking for a talk like this for 10 years. Thanks!
@CppCon3 жыл бұрын
Glad it was helpful!
@grostig5 жыл бұрын
Great talk, easy enough to understand. Thank you for taking the time.
@betajester57297 жыл бұрын
This was fascinating, gave me a lot to think about.
@HashanGayasri5 жыл бұрын
As a person from the other end (Exchange), I find the differences quite interesting. We target to have an end to end latency of around 30us with a very low amount of jitter since we also care a little about throughput and use a fault tolerable distributed system. We often make performance sacrifices to keep the code base easy to understand for most people.
@LuizCarlosSilveira5 жыл бұрын
Do you use Linux/C++ there, or what else? Also, if you don't mind telling what is the exchange in question I'd appreciate :)
@HashanGayasri5 жыл бұрын
@@LuizCarlosSilveira Yes, that's generally the combination used. It's Millennium Exchange - used by the London Stock Exchange and others.
@VeiusIuvenis2 жыл бұрын
An Exchange does not need to be too fast. When it slows, it slows equally.
@cavingada2 жыл бұрын
currently working as an intern at an exchange and I'd agree, the differences are really interesting. Especially when looking at it from the high performance point of view
@sirantoweif53725 жыл бұрын
The entire presentation is about why you need to learn assembly, machine language and your hardware inside out when going for the fastest execution possible. Mid and high-level languages don't have any of the tools you need. You need to create them specifically for each platform if you want to exploit every single cycle to your advantage. If you change something / upgrade the hardware, then you're 100% going to need to rewrite part of the code or all of it to make it optimized again. Mid-level languages are still insanely fast though, no question about that. The question is rather whether you can afford to place 2nd when you aimed to be 1st.
@eto8956 жыл бұрын
This is great .. love C++
@ciCCapROSTi2 жыл бұрын
This is a great talk. Very different from the others, but still interesting, easily approachable, valuable info.
@CppCon2 жыл бұрын
Glad you enjoyed it!
@edubmf2 жыл бұрын
I wonder if the original presentation is available? it's not as good as having the presenter talk through the "missing" slides however it's better than nothing.
@memoriasIT5 жыл бұрын
This is a great talk, good content and well presented
@xealit Жыл бұрын
Excellent practical talk
@Hector-bj3ls8 ай бұрын
Not sure you're supposed to mention the star gate. Mr SG14. Not SG1, I get it. You just want some recognition for your hard work.
@zhengchunliu58395 жыл бұрын
thanks for the eye-opening talk. If you turn off all other CPU cores, where does your OS kernel run on? would there be interference from OS if your thread and OS share the single core?
@sn3k4 жыл бұрын
That's why it's important not to generate kernel calls - in that case your kernel doesn't have to do anything.
@kenjichanhkg3 жыл бұрын
really want to know more about the measurement setup, how do you actually tap the line, how to actually do the configurations?
@GustavoPantuza7 жыл бұрын
Very good talk. Thanks for sharing, Carl.
@oblivion_28523 жыл бұрын
How does the flat unordered map not have cache misses? I thought it'd be a cache miss whenever the data has to be dereferenced from a pointer? Is the idea to allocate the data such that the data follows the map in memory so that the l2 and l3 caches hold it?
@serg_joker3 жыл бұрын
According to my understanding, it saves cache misses on traversing nodes in case of collisions. Still, it will have cache miss on the accessing `value` after we found correct node. With list-based node store, every jump to the next node is a cache miss. But, for example, if we have 3 items in a the bucket and you look up for the third, with linked list-based approach, you going to have two cache misses (because of traversing nodes in a bucket) and with flat map you'll have just one (when dereferencing data).
@Mallchad4 ай бұрын
Late but speculative hardware pre-fetching. Guess the presumed destination for a function and pre-fetch the area before the instruction is even executed so it arrives early. Also if you cache coherency is good you can keep a very large portion of the map in _at least_ L3 cache.
@Carutsu2 ай бұрын
58:40 I do not think I follow this, what is the condition he's pointing that would block all mutexes?
@LordNezghul7 жыл бұрын
Why fight with your operating system when your program can be operating system itself thanks to solutions like includeOS?
@carlcook81947 жыл бұрын
Fighting the operating system isn't too hard to be honest. It's the hardware that causes me the most problems. IncludeOS is cool, but it has a long way to go before the tooling for it is good enough to run commercial HFT software (think about debuggers, profilers/instrumentation, monitoring support, etc). I do see potential in includeOS, but won't be an early adopter of it!
@msnzuigt4 жыл бұрын
Isn't includeOS DOS with extra steps?
@AbhishekTiwari-hc5ph3 жыл бұрын
Could anyone reproduce the problem shown at 44:47 ? I couldn’t with older GCC compilers..
@butterfury7 жыл бұрын
Very cool talk
@tonvandenheuvel7 жыл бұрын
Hey Carl, great talk. During the talk you mention having to delete a whole lot of slides; I wonder whether it is possible for you to share those slides as well? I think folks will be interested in those as well.
@carlcook81947 жыл бұрын
Hi Ton, thanks, excellent point. I'm working on another (very similar talk now to be presented at pacificplusplus.com), once those slides are ready I'll post a link here. It's mainly just material on branch prediction, but will still be useful.
@superblitz22742 жыл бұрын
@@carlcook8194 link please? It's been 4 years haha
@TheValiantFox7 жыл бұрын
Awesome talk!
@Conenion7 жыл бұрын
Very good talk. Thank you.
@ludwig76867 жыл бұрын
how do you log the trading systems actions? that should be overhead
@carlcook81947 жыл бұрын
Indeed there is logging in there. But I will say a couple of things: 1.) We keep logging to an absolute minimum, and 2.) A lot of work goes into keeping the logging code as low-touch as possible, i.e. making sure that the logging is done from a separate thread, not evaluating format strings in the hotpath, and a few other tricks. I actually spoke about logging in this talk a year ago if it helps: kzbin.info/www/bejne/q52yfXqOaK2Beasm14s
@sami-pl5 жыл бұрын
@@carlcook8194 that's seem more like tracing pre-determined things and for it to be fast it better fit in a cache line with some timestamp even at cost of precision or maybe even relative. This store can actually be even marked as non temporal for it not to eat cache (excluding WC). Then once in a while it may be picked by another thread or well memory mapped FIFO. Anyways it's nice hearing you things from you Carl, you do have that ability to go through hard stuff without unnecessary details which if one is interested, will have to go anyways. Good job
@colinmaharaj2 жыл бұрын
22:45 I want to do my calculations for an entry point in separate threads man. I'm coding pattern detection, if I'm monitoring 20 stocks I need my threads. Thoughts?
@bearwolffish2 жыл бұрын
Top tier talk
@dosomething33 жыл бұрын
14:38 "HandleError(errorFlags);" should only tell another thread to handle the error. The current thread should only perform the hot-path and non of the actual error handling. This way, the current thread's cache is hot-path-optimized. So we now have a hot-path-thread. Your thoughts?
@scottsanett7 жыл бұрын
He's speaking a rhotic New Zealand accent... Interesting
@carlcook81947 жыл бұрын
Ha ha, my wife is a linguist and she was impressed by your observation! Yeah, I lived in the southern part of NZ for a three years where there is a Scottish population, and then I also lived in The Netherlands which influenced my strange accent.
@rajesh09ful6 жыл бұрын
@@carlcook8194 John oliver has done his part quite well to edify the global population about New zealand accent :)
@UCCi6g4buFK8dVtAxbFYEKzA2 жыл бұрын
Does this form of trading generate any value (for the world) or is it just parasitic?
@haha71687 Жыл бұрын
Market makers provide liquidity.
@hashishishin7 жыл бұрын
That should be an RT OS or a specialized FPGA sitting right there on that network card!
@carlcook81947 жыл бұрын
Indeed! But often you need to move faster than FPGA development, and/or hit limitations about how many FPGAs you can have per exchange. So far I've been fine not needing real time operating systems, but it's certainly worth thinking about.
@rinkaenbyou48165 ай бұрын
2:40 They are selling std::future and std::option. That's good to know.
@spinthma4 жыл бұрын
Really great understandable insights!
@michieljaspers98897 жыл бұрын
Interesting talk. Do you actually measure ( aka. use performance counters ) cache hits / TLB hits when optimizing code, or do you optimize code based on instinct and experience? Modern cpu's often have HW prefetchers for instructions. Strive for long instruction streams (aka no jumping in Program Counter) to fully utilize this feature. There are many other optimizations to be done based on the microarchitecture of the processor. Do you guys have experts from Intel helping out?
@carlcook81947 жыл бұрын
Hi, indeed I do take a look at cache and TLB metrics, but really it's all about measuring end-to-end time, first in the lab, and then in production. So if I see that the results in the lab are slow (or even worse, production), one of the first things I'll look at is the output of linux perf. In terms of coding/optimizing based on experience, I have found that for both my code and other people's code, making educated guesses doesn't usually work. It's all about measurement, refinement, and then measuring again. For your next question, I don't talk to Intel engineers directly, although I have done a few training courses which gave me some useful insight (including learning that the micro-opcode inside the Intel chips is propitiatory, not documented, and not guaranteed to stay the same over different releases of CPUs). So it's a bit hard to specialize in a single Intel CPU. What I will say is that for when we really need ultra low latency, then FPGAs and other custom hardware is often most suitable. What helps is that for when you are competing against other software-based trading systems, then knowing to a reasonable level about how to make optimizations is often good enough. You can spend a lot of time with one chipset, bios, CPU, network card, operating system, and get a software setup very, very fast, but that's a lot of effort. Which in some cases that's fine and justifiable, but in other cases isn't worth the effort, i.e. it's sometimes better to be the first to find new market opportunities and get something relatively fast up and running in software, and be the first to have a successful trading strategy from this. This is why I think being good at writing fast C++ is always a useful skill to have.
@ajaardvark7 жыл бұрын
good question; but.. a trading system is not just calc heavy; it's also about decisions, and that means branching.
@AdrienMahieux7 жыл бұрын
Hi, I was wondering about the added insights of using vtune instead of perf. Seems there's a lot more pcm to get there, but didn't have time so far. Skylake-SP made such a great arch change, gotta check everything again ! Very nice talk !
@kumaranshuman51385 жыл бұрын
Hi Carl, Was just trying to see how one can measure the number of collisions happening while accessing STL unordered map?
@kostyatrushnikov20677 жыл бұрын
At slide 19, how can we call manager->MainLoop()? Does it mean, that we have a "virutal void MainLoop() = 0; " at IOrderManager? Then we still have virtual funcs+templates instead of just virtual funcs :)
@BJovke7 жыл бұрын
Exactly. And there also might be a memory leak there. Factory() function is implicitly casting OrderManager to it's base class. IOrderManager. If there are some member objects of OrderManager which allocated RAM from heap their destructor won't be called. Also I'm not sure that unique_ptr can be assigned a value from unique_ptr with different template.
@rationalcoder7 жыл бұрын
There isn't a memory leak as long as the base defines a virtual destructor and OrderManager's destructor works to begin with. Also, that assignment of unique_ptr's is a valid conversion: en.cppreference.com/w/cpp/memory/unique_ptr/unique_ptr
@BJovke7 жыл бұрын
"There isn't a memory leak as long as the base defines a virtual destructor". I don't see that anywhere in the example. Also, the derived class' destructor needs to call destructor of base class. Yes, you're right, unique_ptr assignment works, but there's an implicit cast from OrderManager to it's base class. There are too many "hidden" things here. I know that this is just an example but, in my opinion, an example for a conference like this one must be much better. I also don't like seeing some other stuff, like function definitions with empty brackets - "void SendOrder ()". Is it so hard to put "void" in it? A lot of work has been done on C++ standard to increase strict type enforcement. After all, C and C++ are supposed to be strongly typed languages. C programmers and even standard broke this a lot with huge amount of software exchanging "void *" pointers all around the code. And now C++ programmers are breaking this with a lot of implicit conversions and class to base casts. In this example they could have just used a pointer to sender function and switch it to different one when needed.
@rationalcoder7 жыл бұрын
The source for IOrderManager isn't given in the example, it is simply implied that it has a virtual destructor (if not, you will get a leak like you mention). Also, dtor's of derived classes call the dtor's of their parents implicitly, reversing the order of construction, so this example works as intended assuming he didn't mess up IOrderManager. See this: stackoverflow.com/questions/677620/do-i-need-to-explicitly-call-the-base-virtual-destructor
@BJovke7 жыл бұрын
Yes, you're right, my mistake.
@gracelee79292 жыл бұрын
How can you create an exception without throwing it?
@vedantsonu19837 жыл бұрын
Hey Carl. Great Talk ! How do you lock L3 cache for one core ( or a set of cores ) without actually turning off the CPUs ? I tried searching but in vain. Can you provide some pointers for that ? Thanks!
@carlcook81947 жыл бұрын
Hi, here is a link to a basic paper about cache access technology: www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html. Hope that helps!
@christopherjohnston37727 жыл бұрын
Can be done with CAT, pmqos has a decent CLI to get that going.
@ThePizzaEater10003 жыл бұрын
i don't even do anything in C++ idk how I got here
@Pacificplusplus7 жыл бұрын
Here is another version of this talk: kzbin.info/www/bejne/eKnJhWycnrqmkJY
@anthonyvays57866 жыл бұрын
great talk
@mworld2 жыл бұрын
Yes map is slower then unordered_map since map it trying to sort as well. I found this out when dealing with millions of records.
@VeiusIuvenis2 жыл бұрын
It depends.
@donha4754 жыл бұрын
great talk!
@maxmustermann55904 ай бұрын
Absolutely great talk, but my guy had a 140 puls the whole way through lol
@YSoreil7 жыл бұрын
Hey, great talk. I'm curious if there is any viability to using non intel parts for this type of single core workload. In particular I was curious if it is viable to use for instance POWER architecture CPUs here. The very fast 5HGz+ cores which are available there and the very large caches might be interesting. This seems like one of the few still existant markets where these parts might make sense for non legacy work. I originally had a more interesting reply but the KZbin autoplay function deleted it, the bane of good videos one actually wants to watch to the end I guess.
@carlcook81947 жыл бұрын
Interesting comment! I've never looked at other architectures/CPUs, but I am sure some will probably give a decent advantage, once you get familiar enough with them. For example, ARM gives less guarantees about order of reads/writes from multiple threads, which probably means it is faster in some multi-threaded cases (as just one example).
@OlivierLi7 жыл бұрын
Good job! Could you please provide a link about the warming up of SolarFlare adapters?
@carlcook81947 жыл бұрын
Sure thing. Check out the open onload API documentation for SolarFlare cards (support.solarflare.com/index.php/component/cognidox/?file=SF-104474-CD-23_Onload_User_Guide.pdf&task=download&format=raw&id=361), in particular this section: 8.15 ONLOAD_MSG_WARM
@petrupetrica93664 жыл бұрын
41:24 Why would you say something so controversial, yet so brave
@DanielMonteiroNit7 жыл бұрын
Hope the use of float is for brevity :-P
@carlcook81947 жыл бұрын
It's funny that no one commented on this at the time. Indeed, float would never be used in practice... most likely fixed point integer.
@TeeDawl7 жыл бұрын
Hey Carl, I really do enjoy the fact that you're here in the comments. I also really enjoyed the talk! Greetings!
@DanielMonteiroNit7 жыл бұрын
Been playing with sg14::fixed_point lately and it's very good! But hey, which size of fixed points is usual in fintech? What "distribution of bits"? Is it fixed or varied distributions for different uses? Also, at 13:10, are you initialising the error flag somewhere else? Would it be faster with, say, int64_t errorFlags{NO_ERROR}; ? Thanks again for the talk.
@lapetiteanessesoap75592 жыл бұрын
good job
@JiveDadson7 жыл бұрын
Subtitle: How to trick the wrong operating system running on the wrong hardware into usually being good enough.
@carlcook81946 жыл бұрын
Indeed. And being able to move very quickly to new markets and strategies because you are running fairly standard hardware and operating systems which are relatively easy to program against and install. However, every trading company will no doubt also be running their ultra low latency systems on custom hardware and no real operating system to speak of.
@allopeth5 жыл бұрын
22:10 completely hilariuos slide xDDDDD
@anthonyvays57866 жыл бұрын
might as well use Fortran instead of fighting with the compiler so much
@scramjet46103 жыл бұрын
Why not use C instead of C++ when speed is so important?
@alexandrosliarokapis2272 жыл бұрын
Because C is not faster than C++ and lacks a lot of abstractions that C++ can get without costs. Think templates vs Macros.
@nishanth64032 жыл бұрын
C need not be faster than C++. And since comp time programming hit Cpp I've seen it beat C quite a no. of times.
@vladosique5287 жыл бұрын
Dude's ripped.
@JiveDadson7 жыл бұрын
Given the job he admits to having - sneaking a profit by cutting in line (queue) - the rippitude may come in handy.
@GeorgeTsiros2 жыл бұрын
people who write "benchmarks" sometimes baffle me it's like they don't really care about measuring anything just to have the cpu pegged at "100%" or something and then spit out numbers with lots of significant digits
@LuluTheCorgi5 жыл бұрын
Is there even any infrastructure left to produce single core CPUs?
@blazkowicz6663 жыл бұрын
Why not just use C? Also, I'm curious to know how QUIC and gRPC would fit into your plans to reduce latency
@bvskpatnaik91153 жыл бұрын
QUIC and gRPC are network protocols used for communication. This means stock exchange must support it and honestly I don't think any exchange would support gRPC since that's not ideal for good latency.
@crystalgames6 жыл бұрын
great talk thanks
@KX362 жыл бұрын
He says intel wouldn't be interested in making CPUs for this market, but this is literally the market that pays for all the international undersea fiber optic cables just to reduce latency slightly. You pay intel enough, they;ll be interested.
@VeiusIuvenis2 жыл бұрын
Probably too much for them...
@magwo7 жыл бұрын
Any thoughts on Rust? It seems to me an ideal language for the requirements - predictable, consistent high performance, and high productivity due to modern, high level features that still don't introduce high latency behaviour (by avoiding memory allocations and memory churn). It's like the "low latency techniques" happen to be the "best practices" of Rust development, whereas the "best practices" of C++ is not the same as the low latency techniques of C++ dev.
@carlcook81947 жыл бұрын
Yes, this is exactly the point. Rust does look cool, but C++ has so many 1000s of man-hours built into the language, the tools, etc, it's going to take one heck of a language to convince people to give up on C++. The same is true for includeOS. Another great idea, fresh thinking, and should in theory make a lot of the tricks that I need to pull obsolete. But it's still too early to start using this product commercially.
@nishanth64032 жыл бұрын
Except that best C++ practices are largely variant across different domains, and usually extreme low latency design is considered "best practices" at places where Cpp is still used heavily today.
@thomas_investor5806 ай бұрын
Rust looks way worse than cpp when you want to go nano …
@trejohnson76772 жыл бұрын
Zero cost abstractions == perpetual motion machine.
@jonastoth79757 жыл бұрын
I guess the trading CPUs are all 7700k? Why bother with Xeon there, which is there to have many cores?
Now if this isn’t an expert way of saying “ Functional Programming in C++”
@username44417 жыл бұрын
ok but which random bot do i play russian roulette with on github and download to use as my first bot?
@carlcook81947 жыл бұрын
This one: github.com/askmike/gekko
@GokuNaru0076 ай бұрын
Let Carl Cook
@xl0005 жыл бұрын
4:09 faster by a picosecond ? ie faster by 1E-12 second the granularity of ethernet packets is larger than that
@kyonsmith90465 жыл бұрын
16:39 He still needs one virtual function call to Mainloop.
@VeiusIuvenis2 жыл бұрын
But that only run once, which is not in hotpath.
@ntrandafil4 жыл бұрын
Thank you for the talk! 33:28 😂
@ShakthiMonkey Жыл бұрын
let him cook
@samuelutomo6 жыл бұрын
35:08 👍
@VeiusIuvenis2 жыл бұрын
Good talk, but a lttle too brief... leaving me a lot of questions...
@tohopes7 жыл бұрын
32:20 dragons lie here.. lol
@tohopes7 жыл бұрын
tfw store didnt have any white shirts so you had to give the talk wearing a pink one
@tohopes7 жыл бұрын
52:14 tfw the guy who bought the last white shirt you wanted used your profiling server as a build machine
@carlcook81947 жыл бұрын
Ha ha yes, I asked them to cut this calendar screen-grab from the video, but oh well...
@tohopes7 жыл бұрын
I would be pissed if some tab did that to me during a presentation. You handled it well. Great talk btw, very informative.
@riachomolhado9996 жыл бұрын
C ++ is only for males-alphas system. Do this in Java and you will lose millions of dollars.='D
@xExekut3x3 жыл бұрын
"high frequency trading systems" .. should be illegal.
@llothar687 жыл бұрын
Total unethical industry he is working in. This should be forbidden, thats even more important then gun control.
@myguiltybody7 жыл бұрын
As a leftist I don't disagree with you, but he's obviously not the bourgeois capitalist, just a developer, and this talk has some valuable information.
@tagged5life7 жыл бұрын
it's the other way around, trying to provide liquidity causes low latency trading.
@richardhickling19057 жыл бұрын
Well that's the idea - but statistically it doesn't work that way: HFT is just too fast. The orders they provide - though they dominate numerically - don't provide better prices for the counterparty. Also they don't trade illiquid contracts. Paid market-makers have to be pulled in for that. Not blaming them BTW: they're just using technology effectively to make money. I can't see it as unethical either. But there should be more regulation - contact your democratic representative.
@budabudimir16 жыл бұрын
@@richardhickling1905 Isn't it entire idea that curtain HFT-s can make money by just squeezing in into the inefficient market? I.e. someone wants to sell curtain instrument for $100, but someone else wants to buy that instrument for $90. HFT is going to see this inefficiency and offer to buy for $92 and sell for $98 for example, thus saving $2 for a buyer, earning $2 for the seller and earning $6 itself.
@jonwise34195 жыл бұрын
High-frequency trading is in the high end of the spectrum of the soul-sucking experience (Speculation -> Day Trading -> High Frequency Trading). It probably feels like Satan himself is both sucking your soul and your dick at the same time, or, to use a good old trick of obfuscation through abstract parlance and legalese, it probably feels like adding liquidity to the market.