Network and Server Uptime Calculation Considerations for Monitoring tools
Network and Server Uptime Calculation Considerations for Monitoring tools
Uptime is one of the most important IT infrastructure operational metrics that gives an overview how “stable” or “reliable” your IT infrastructure is with 99.9999% uptime being a platinum standard.
But how do you calculate an Uptime?
In ideal (continuous and none-discrete) world calculation of Uptime is somewhat simple.
Take number of seconds in monitoring period and number of seconds when monitored object was down and use simple formula:
Uptime = 100 – ((Outage duration / Total time) * 100)
Example:
Monitoring Period: 1 year = 31,536,000 seconds
Total Object Outage Duration: 300 seconds
Object Uptime: 99.999%
Monitored Objects achieving “six nines” uptime should only be “down” for a maximum 31.5 seconds in the 365 days.
Uptime % | Downtime per Year |
99.99% | 3120 sec |
99.999% | 300 sec |
99.9999% | 31 sec |
99.99999% | 3 sec |
But as you get to “six nines” or higher, capabilities and configuration of monitoring tools starts to play critical role in accuracy of uptime calculations.
Single Server Example
Let’s start with an example of Uptime calculation for a single device such as a Server.
First, we need to define what constitutes a server being up or down and what tools we planning to use to determine its state.
Let assume that we use a classic ICMP v4 probing with ‘Polling Interval” equal to 1 second.
In other words, we will be sending Ping packets from a monitoring agent to the Server every 1 second and if server does not respond we shell consider it down.
Simple enough?
Well, may be in perfect world yes, but we live in real world and packets may get lost for reason other than Server being down.
Packets may get lost due to traffic shaping, CRC errors and many other reasons. So, to prevent influx of false positive Server down events we need to increase number
of consecutive packets that must be missed to consider Server to be “down” to a number greater than 1.
Let’s call this number an “Assurance Multiplier”.
Greater “Assurance Multiplier” values shell result in greater probability that we detect an actual Server down event. But at the same time greater “Assurance Multiplier” will result in slower detection time for Server down events and inability to detect short-lived outages with duration time less than (Assurance Multiplier * Polling interval) seconds.
We also need to introduce two new parameters: “Actual Outage Duration” and “Detected Outage Duration” to reflect the fact that duration of the Server outage calculated by monitoring agent may be slightly greater than actual outage of the Server due to the fact that duration of polling interval is > 0.
Summary
“Polling Interval” – Time between two consecutive state polls.
“Assurance multiplier” – Number of consecutive polling intervals when object’s state must be Down to consider monitored object to be truly down.
“Outage Detection Time” – Time is takes for monitoring Agent to detect an outage of the monitored object after outage has started.
“Actual Outage Duration” – Actual Outage Duration of the monitored object.
“Calculated Outage Duration” – Duration of the Outage as calculated by the monitored object.
Considering that Actual Outage Duration > (Polling Interval * Assurance Multiple) worst case scenarios should be calculated as following:
Outage Detection Time = (Polling Interval * Assurance Multiplier) + Polling Interval
Calculated Outage Duration = (Polling Interval * Assurance Multiplier) + 2 * Polling Interval
Lets do the math with following Monitoring Agent Configuration Example : Polling Interval = 1 Sec and Assurance Multiplier = 3
We get following results:
Outage Detection Time = 4 Seconds
Best Detectable Actual Outage > 3 Seconds
Best Calculated Outage Duration > 5 Seconds
Now we can see that provided example of Monitoring Agent configuration is sufficient for grading Server’s Uptime as 99.9999% but not sufficient for 99.99999% classification.
To classify monitoring tool for 99.99999% accuracy you need to decrease Polling Interval or decrease Assurance Multiplier by at least 30%.
So next time when you see a 99.99999% Uptime calculated by a Monitoring tool with a Polling interval of 5 minutes you know that it is likely not true.
It gets more interesting when we move into calculation of Uptime for a Network rather than a single object like Server.