"How NOT to Measure Latency" by Gil Tene

Рет қаралды 103,422

8 жыл бұрын

Time is Money. Understanding application responsiveness and latency is critical but good characterization of bad data is useless. Gil Tene discusses some common pitfalls encountered in measuring latency and response time behavior. He introduces how simple, open sourced tools can be used to improve and gain higher confidence in both latency measurement and reporting.
Gil Tene
AZUL SYSTEMS
@giltene
Gil Tene is CTO and co-founder of Azul Systems. He has been involved with virtual machine and runtime technologies for the past 25 years. His pet focus areas include system responsiveness and latency behavior. Gil is a frequent speaker at technology conferences worldwide, and an official JavaOne Rock Star. He pioneered the Continuously Concurrent Compacting Collector (C4) that powers Azul's continuously reactive Java platforms. In past lives, he also designed and built operating systems, network switches, firewalls, and laser based mosquito interception systems.

Пікірлер: 16

@pranytt3485 Жыл бұрын

Key takeaways for me : 1. Most of the tools that capture the response times, report 99 percentile latency of every 30 sec duration. For example prometheus metrics are scraped every one minute. But the real thing to look at is the Max response time. 2. Gatling fixed the co-ordinated omission problem. Most of the other tools like Jmeter, etc still have this problem. So use Gatling for your load generation and reporting purposes. 3. Didn't understand co-ordinated omission fully. But I'm now informed that it is bad and needs to be looked out for. 4. When a graph shows sudden spike, it is an indication of a 'possible' coordinated omission. If a graph is smoothly growing it is an indication that there is no bad data. Exceptions maybe there to this rule. 5. There is no point in looking at percentile graphs if you don't have performance goals set for your service. If you are comparing two systems and your target is 20ms, then you could plot graphs and see what is the maximum throughput each system supports while maintaining latency at 20 ms.

@TheSuckerOfTheWorld 8 жыл бұрын

10 Minutes in and I already see the very obvious flaw that +Gil Tene pointed out in my day-to-day monitoring. Great talk!

@whitegelfling 8 жыл бұрын

Coordinate emission: One issue here is one that is often encountered in metrics in business, and that is that the bosses want simple, easy, and reliable numbers to look at. To the guy behind the project it is seen as a system that ions out a rare case, without understanding the maths behind it.

@timothydsears 8 жыл бұрын

Terrific talk about load testing and lazy thinking. The early part probably applies to anyone thinking about metrics for a complex system.

@TestAutomationTV Жыл бұрын

Nice talk, I've read good things about it. Now starting to listen, looking forward to finding some good stuff about performance testing.

@WilsonMar1 8 жыл бұрын

[6:52] I don't have the data. A common problem we have is we plot only what is convenient. We only plot what gives us nice colorful charts. We choose the noise to display.

@Turalcar Жыл бұрын

I'm more used to graphs being split for request kinds. To me the first thing that jumped out was the large difference between 50th and 75th percentile.

@minimaddu 8 жыл бұрын

Great talk! I'm curious, we get most of our production response time stats from AWS load balancer logs. Is that an accurate measure of response time?

@ruimeireles1695 3 жыл бұрын

Anyone can write all the tool names mentioned in the presentation? I can't find some of them, probably because I'm not writing the name correctly.

@ericj1380 2 жыл бұрын

@12:04, is this because of 5 page loads/40 resources per page increasing the chance of hitting above p99? If that’s the case couldn’t you just adjust each graph to be on a per-resource or per-page basis? Which seems like it would directly reflect the percentile.

@whitegelfling 8 жыл бұрын

Ok, i'm only a few mins in and my brain hurts.. I can't belive that people seriously ignore the max in things like this.. scary.

@MikkoRantalainen 4 жыл бұрын

I agree. Only maximum (worst case latency) and median latency are worth wathing. Everything else is just noise.

@MikkoRantalainen 4 жыл бұрын

Note that "median" is not the target, the diffence between the worst case latency and median latency is the part of the picture that could get better if you fix the bad stuff. Getting median latency downwards often requires LOTS of changes to the system.

@MikkoRantalainen 4 жыл бұрын

All well made latency graphs should have number of the requests per second on the horizontal axis and maximum response time on vertical axis. The number of requests per second that gets the maximum response time too high is the limit.

@GeorgeTsiros Жыл бұрын

that, is why "how to measure", by itself, is an entire class in physics (at least) courses.