A Better Measure of Customer Service: Response Times and How to Talk About Them
With a platform that handles billions of transactions every hour and hundreds of millions of internet users every day, every solution must be built to scale. – MediaMath.com
Throughput Measures How Many Customers You Serve. Response Time Measures How Well You Serve Them.
We are, rightly, very proud of our ability to handle the high throughputs our business requires. We highlight throughput in our marketing materials, list it front and center in our job postings, and graph it on any number of dashboards in our monitoring system. It is a powerful metric for the growth of MediaMath as a company. As we add more biddable inventory and as our number of clients grows, the number of requests per second (RPS) we handle grows, as well. Our ability to scale is a strategic advantage, and we are very cognizant of it. Luckily, it is also easily attached to revenue, making it easy to prioritize.
While throughput is a measure of how many customers a company can serve, it does not tell you how well you serve them. For over 40 years response time has been central to customer happiness. It is not, however, a number that companies emphasize in their marketing materials. It can be hard to quantify its impact on revenue, so the places where response time gets prioritized are largely driven by external requirements (exchange response times, for instance).
Throughput is an easy number to talk about and visualize. 70K RPS or 250 RPM (requests per minute) clearly state the throughput of the system. There is complexity around batching effects and burstiness of requests, but by definition throughput is an aggregate number.
Unlike throughput, response time is not an aggregate. Each response generates a unique measurement. Frequently these collections of measurements are very big. For instance, a 70K RPS system would have over 4 million response time measurements in an hour. In order to summarize or visualize response time measures, we must resort to statistical analysis, which introduces subjectivity. Depending on how one aggregates these measures, they will get different impressions of how well they are serving their clients.
Response Time is Not Normally Distributed
When visualizing or aggregating response time measurements, it is important to know that in real world systems, response time measurements are highly skewed. That is, most systems tend to have consistent response times for the vast majority of measurements and a small set of dramatically slower measurements. This skew causes time based plots to be hard to interpret:
Image 1: This is a graph based on dummy data.
The dummy data above represents a well performing system (ie has fewer and smaller outliers than most systems), yet the outliers still dominate the time based plot. Many monitoring systems make dubious analytical decisions to compensate for this, most commonly simply removing the outliers from the data set. This is equivalent to ignoring all the customers who have the worst experiences, and then deciding that you are doing a good job servicing customers!
The skew also makes techniques that expect “normal” distributions inappropriate. For example below is some summary data from our dummy system:
- Total measurements – 6,284,817
- Mean response time – 0.11 msecs
- Median response time – 0.11 msecs
- Standard Deviation – 0.04
- Max – 10.94 msecs
If this were a normal distribution, we could expect 99.99% of responses to be faster than 0.259 msecs. In the actual sample, the 99.99% number was much worse than that – 1.15 msecs. This dramatic skew makes it problematic to use the mean or median as indicators of how well the system is performing. If median or mean response times were to rise enough to overcome the skew and become noticeable, it would be evidence that the system is seriously broken and has been for a very long time.
When Talking About Response Time, Use Statistics That Are Meaningful to Your Customer
In the face of this skew, how can you present metrics that will be meaningful and relevant to your customer?
In order to answer that, first state the metrics as they relate to how well you are serving your customers. For instance, a median response time of 0.11 msecs means, “The best 50% of your interactions with clients happened in at least 0.11 msecs.” Notice this doesn’t make any claims about the worst half of your interactions with clients (the ones you probably actually care about). The max can be expressed as “All of your client interactions happened in at least 10.94 msecs.” This actually undersells how well you are doing (as it is a dramatic outlier).
A histogram captures a visual representation of how a system performs much better than any other visualization tool.
Image 2: Histogram of the same dummy data.
While a histogram is good for visualizing response times, the issue remains how to write about and discuss response times. In those cases, you must still sample your summation data. Rather than just picking one point, it is better to discuss the full curve of response times.
A balanced way to summarize this example is to combine all of these statements:
- Most responses were 0.11 msecs or faster.
- 90% of responses were 0.15 msecs or faster.
- 99% of responses were 0.20 msecs or faster.
- 99.99% of responses were 1.15 msecs or faster.
- The worst customer interaction was 10.94 msecs.
This summary allows for everyone in the conversation to understand the range of experiences your customers encounter. It also does not mislead anyone about the worst case performance and still does not overemphasize the worst case.
I hope these suggestions will help you as you build and measure your systems. Next week, I will be sharing my thoughts (and rants) on statsd, a commonly used monitoring tool.
- Gil Tene’s talk at React 2014 was a major impetus for writing this article.
- He also wrote a couple of invaluable tools for measuring response time, mostly on the JVM, but also ported to other languages. HdrHistogram and jHiccup which I used to generate the histograms and charts for this.
- If you are interested in low latency systems on the JVM then the Mechanical Sympathy is a good starting point. JVM performance techniques often translate to other languages and platforms as well. Further, the nuance of measuring response time is a common topic of discussion.
- While I was writing this, a great article came out about benchmarking with the Yahoo Cloud Serving Benchmarking tool. It is a great overview of response time testing generally and also about a tricky response time topic known as “coordinated omission”.
- customers – We as a business have clients that are humans and computers. Also, some of our systems only support internal clients instead of external ones. For the purposes of this article these distinctions largely do not matter. Everyone prefers to work with responsive systems.
- jHiccup – This “well performing system” is actually a sample captured via the jHiccup tool on an open stack virtual machine that was not running anything. Put simply, this is the jitter placed on a process by the OS and the JVM in a system that is not doing
- mean 90 – statsd has a metric called mean-90. It is perhaps the silliest response time metric you can use in production. It removes the worst 10% of samples and then reports the mean of the remainder. In combination with the normal statsd flush interval mechanics this metric ensures that you will only ever see the good response times.
- 99% – Deciding how many 9s to report is important. For example, if we use the 1 billion throughput system from the quote at the top, 100K customer interactions can be painfully slow and still meet the 99.99 percentile.