When observing network utilization (whether that’s bandwidth or some other element you monitor), you have to know your baseline. The big idea is to understand what’s normal for your network, as every network is a little different. Only when you know your network’s baseline does it become possible to detect anomalies. For example, when monitoring bandwidth, some traffic spikes are normal. Some may not be. Unless you know your baseline, it’s difficult to tell whether the traffic spike is an event that you should react to. In this context, a baseline does not mean a single data point of average utilization over a 24 hour period, or even 95th percentile utilization. Rather, the baseline is knowing what a normal pattern is.
I like to observe 24 hour periods. A full day reveals some things you might not expect to see, especially off peak-hours when most network consumers are off-premises. Most of the time, it makes sense to consider Monday through Friday differently than Saturday and Sunday. Weekday traffic patterns tend to be different than weekend traffic patterns. When looking at graphs and determining whether or not a traffic spike is an anomaly, I compare the 24 hour period from a week (or weeks) ago to the current day. In my experience, this approach can work for most enterprises. At the least, it’s not a bad place to start, especially if you’ve got nothing today.
One caution I’ll throw in here about monitoring a current 24 hour period vs. an older 24 hour period is that most network monitoring systems perform a rollup of data periodically. Post rollup, older data is a summary of the more granular data that was once known. To illustrate granular data vs. rolled up data, look at the two graphs below. One is for a day with all data intact, while the other is a week older with rolled up data. If you look carefully, in the graph from 4-November, every vertical bar represents 5 minutes. In the graph from 28-October, every vertical bar represents 30 minutes. That’s a significant loss in detail, which means that the actual characteristics of the anomalous time period is masked.
To look even more specifically, notice that on 4-November, there’s a spike between 2:20p and 2:30p of roughly 20Mbps and 30Mbps. When the data for 4-November is rolled up, the spike is going to get averaged into a half-hour granularity. If we assume that traffic for that half-hour block is what’s being averaged, the post-rollup result for 2:00p – 2:30p is going to go from 6 data points of 13, 12, 14, 15, 31, and 24 Mbps to a single data point representing ~22Mbps. While 22Mbps might still stick out as a small departure from the baseline of ~18Mbps registered last week, it’s harder to notice.
Back to the original point, my encouragement is to know what your baseline is. If you aren’t gathering that data, then you simply don’t know. Most network management systems track this for you, but you should make no assumptions for how well, for how long, and what the rollup behavior is. You might have some tweaking to do to get a result better than the default behavior.
By the way, if you don’t like the idea of your NMS rolling up data, you might look into the monitoring solutions from Statseeker or NETSAS (both Packet Pushers sponsors currently or in the past). I know the engineers behind those products don’t like rolled up data either.
How To Get Started
If you’ve never tackled this sort of thing before and don’t know where to start, here are a few recommendations that will cost you little or nothing.
- Enable SNMP on the network device you’d like to monitor.
- Download Observium. Observium is free and reasonably easy to use. You can get a version of Observium packaged as a Linux appliance to load on your hypervisor (and in other form factors) from TurnkeyLinux.org.
- Point Observium to your network device using the proper SNMP strings. Observium will likely discover the device interfaces and several other interesting things about it, and start monitoring for you.