Yesterday I had a chat with a friend about computer networks, hardware upgrades and system monitoring and I found out that I had created in the last couple of years a very robust and detailed network monitoring and systems monitoring system, and that it has made my life a lot easier than what I could have gotten.
For example, in Nagios we monitor nearly all aspects of our FreeBSD based servers: Not only the standard memory, CPU and diskspace, but also the answer from the DNS server on it, the presence of the crond, snmpd, inetd, sshd and syslogd. Not only do we monitor if all required processes are running, but also if their PID files are there and if the processes in these PID files do exist. And we monitor the status of the RAID cards, the status of the ethernet cards and were the default gateway points to. And the uptime of the server and the offset of the NTP synced time of the server.
With regarding to network devices (routers, switches) we monitor the uptime of the device (these things reboot faster than Nagios can detect), we monitor the status of all ports (duplex, speed, operational status), temperature and status of the power supplies. And the status of the OSPF neighbours and BGP neighbours, plus a list of expected networks in the routing table.
Network link devices (antennas, fibre convertors, laser heads) which support some form of remote management are checked the same: ethernet link status, radio link status, uptime. Anything which will display possible problems with it.
For our PABX's we monitor the status of the PRIs, the status of the IAX and SIP destinations.
Call it overdone, call it wasted too much time on monitoring... But when I replace a server or a device on the network, I would like to know without too much hassle if everything is back in order once I turn it on without having to go through too much hassle: When my monitoring program says everything is fine, I know everything went fine.| Share on Facebook | Share on Twitter