Yesterday I had a chat with a friend about computer networks, hardware upgrades and system monitoring and I found out that I had created in the last couple of years a very robust and detailed network monitoring and systems monitoring system, and that it has made my life a lot easier than what I could have gotten.
For example, in Nagios we monitor nearly all aspects of our FreeBSD based servers: Not only the standard memory, CPU and diskspace, but also the answer from the DNS server on it, the presence of the crond, snmpd, inetd, sshd and syslogd. Not only do we monitor if all required processes are running, but also if their PID files are there and if the processes in these PID files do exist. And we monitor the status of the RAID cards, the status of the ethernet cards and were the default gateway points to. And the uptime of the server and the offset of the NTP synced time of the server.
With regarding to network devices (routers, switches) we monitor the uptime of the device (these things reboot faster than Nagios can detect), we monitor the status of all ports (duplex, speed, operational status), temperature and status of the power supplies. And the status of the OSPF neighbours and BGP neighbours, plus a list of expected networks in the routing table.
Network link devices (antennas, fibre convertors, laser heads) which support some form of remote management are checked the same: ethernet link status, radio link status, uptime. Anything which will display possible problems with it.
For our PABX's we monitor the status of the PRIs, the status of the IAX and SIP destinations.
Call it overdone, call it wasted too much time on monitoring... But when I replace a server or a device on the network, I would like to know without too much hassle if everything is back in order once I turn it on without having to go through too much hassle: When my monitoring program says everything is fine, I know everything went fine.
Recently I had a discussion with a collegue about the monitoring of our systems and network devices. I showed him what we all measure, and he wondered if it was overkill or not. I told him that somethings maybe were, but that good monitoring is the first step to knowing what happens in your network, and that knowing what happens in your network is the first step to be able to isolate problems when they arise.
So the question is: "What do you need to monitor?" The answer is easy: "Everything". That's a pretty big amount. Let reality kick in and rephrase it to: "Everything you can think off". This is too vague and misses the final goal: "Everything you assume to be the normal conditions for a system or service to run properly.".
"Everything you assume to be the normal conditions for a system or service run properly". That is quite a lot, and different for each service you provide or device you have on the network. It will make you to have to you dig into your network and servers to find out what is going on.
The rest of this story can be found at my website.