A computer network can be subject to many problems: the unavailability of a physical server or network failure, the risks are various with very important consequences in general. These failures can block an entire network, crippling the working tools of an enterprise, or prevent clients from accessing a company website. In addition to the shortfall that can cause these failures, they also question the seriousness of the company. A system or network engineer must therefore do everything to improve the reliability of its IT system.

The needs

Prevent outages

There is a need to prevent failures. Of course, a serious study of the system design avoids most of the problems, but, as we all know, it is impossible to make an infallible computer system. Of hardware failure to software failure, the events causing a breakdown on the network often occur unpredictably.

A solution would be to continuously analyze our facilities and act as quickly as possible when critical conditions leading to failure are reached. But, without automatic solution, collect all network information would require considerable work.

Automation of monitoring

To benefit from a permanent and effective monitoring of all of our computer networks, it is necessary to entrust the task of monitoring software, which will work at regular intervals to retrieve a complete diagnostic of our IT infrastructure. This will provide ongoing monitoring, to each server installed, allowing us to act when a problem arises.

Immediate alert

The use of monitoring software will allow us to manage alarm systems in predefined times or whenever it occurs. We can ask the software to alert us when a server reaches critical operating conditions; or when a service has suddenly become unavailable, think about your company website for example. You may consider to be warned by phone, email or SMS depending of the urgency and level of criticity. We can thus be kept permanently from network events, and act immediately in case of problems.

Archived network events

Using a monitoring utility will also allow us to keep an archive of all system data. These archives will be very instructive about the behavior of the server or active network element, giving us the possibility of knowing all the events that led to the crash of the service. This will allow us to better understand the failure and restore the service much faster.

Automatic actions

Many failures are easy to repair and feasible by script. With an appropriate monitoring software, we could launch specific scripts when the failure occurs. This would allow us to repair almost instantaneously a service, and also to automatically perform some maintenance operations, such as purging hard disks in case of saturation.

What to monitor

Networks consisting of a very large set of elements, it can be difficult to target the most important equipments.

Services

Networks usually have the role of providing access to services offered by remote machines, it goes without saying that the machines hosting these services must be constantly monitored. Directly accessible by the user, the unavailability of such services would quickly penalizing the entire network or discredit the enterprise. We must therefore use the monitoring software to continuously watch these services.

Active equipments

Monitoring of services is not enough : what's the point of monitoring services if users cannot access it because of a router failure ? Routers, switches and other intermediate equipments must also be monitored, to ensure that the access to services are operational.

How to monitor ?

Monitoring according to the different OSI layers

OSI layers offer us different possibilities of analyzing a network with layers 3, 4, 6 and 7. Properly used, these allow us to gather essential information about the network status and services :

The layer 3 gives us the "ICMP echo" protocol (ping, the most basic tool), which indicates whether a machine is connected to the network, and in working order. This is usually the first test done during a check, which indicates whether the target device is indeed reached.
The layer 4 allows us to test different ports on a server using a TCP or UDP connection. While this allows to know whether a service is present on a machine, the layer 4 does not allow us to know if a service is fully functional.
The layer 5 provides NetBIOS, Windows only, which collects statistics. This limited availability greatly limits its use.
The layer 6 allows us to use SNMP (Single Network Management Protocol), network monitoring tool. Used by many routers, this protocol will allow us to effectively monitor the intermediate equipment.
The layer 7 gives us a complete test of a server services. One can indeed imagine tests that would carry just a set of request to the service to fully test its availability.

The 6 and 7 layers allow the monitored items to communicate their information to the monitoring software. In layer 6, SNMP can be used to disseminate information broadcast, and layer 7 can implement a monitoring agent installed on the monitored machine.

Monitoring of other factors

There are purely "systems" characteristics to monitor, which will allow us to prevent breakdowns, and analyze the capacity of equipment to support the claims. The CPU and memory load factor gave us an interesting overview of the server load, and could allow control systems to ensure equal percentage of resources to each service. Other factors, such as drives saturation, are to be considered as they can cause a system failure.

Two network monitoring tools are commonly used: Nagios and Zabbix. Both have been licensed under the GPL, which allows their free use and full access to source code.

ZABBIX

Zabbix is a complete monitoring solution that monitors network at lower cost. It can operate either by direct test of machines, or with an agent system installed on the machine sending periodically monitoring information. Zabbix offers many alerts by email or text messages. It can be downloaded free of charge at http://www.zabbix.com.

Services monitoring

Zabbix primarily uses SNMP (Single Network Management Protocol) to collect its information. The program can monitor any application such as supporting SNMP (Oracle, WebSphere, WebLogic, Exchange, etc.) or a server using an installed agent on the target machine.

Servers monitoring

Zabbix server is running on AIX, FreeBSD, HP-UX, Linux, MacOSX, OpenBSD, SCO Open Server, Solaris, Tru64 / OSF.
Zabbix agent runs on the same platform as the server and on other platforms like: Win NT 4.0 / 2000/2003 / XP, Novell NetWare. The availability of certain services (SMTP, IMAP, POP3, HTTP, SSHD, etc.) can be monitored without any agent installed.
Any platform or network component that supports SNMP (router, switch, hubs, printers ...) can be monitored by Zabbix.

Network monitoring

Zabbix produces traffic statistics in a network with a map giving a good representation of network infrastructure. This map shows the components, connections, status of components, and the reasons why a component is not available or has any problems. A map example generated by Zabbix :

Configuration of maps

Performance Monitoring

One of the most important uses of ZABBIX is performance monitoring, CPU load, the number of active processes, number of processes, disk activity, and the available memory size are part of system parameters Zabbix can monitor.
Zabbix can produce graphs to help the administrator to identify bottlenecks in the network. Examples of graphs :

Screen of custom graphs

Alert usage

An administrator can define any possible condition for triggering an alert, using flexible expressions. When these expressions become true (or false), an alert will be sent to any email address that the administrator has set.
An external program can be used to forward alerts, for example by SMS.

Integrity checking

Zabbix provides the ability to monitor critical system files such as configuration files, binaries, kernel, scripts ... An alert is triggered when one of these files is changed, indicating that system integrity is at risk.

Although less common than Nagios, Zabbix offers almost exactly the same features. Zabbix is simple to configure because it is made from an HTML interface. Further, its agents system can monotir system properties on Windows system.

Best new 2015 alternative

LibreNMS

LibreNMS is an autodiscovering PHP/MySQL based network monitoring system forked from Observium. LibreNMS aims to be easy to use, painless to deploy, and support monitoring of a wide range of devices... I will try it as soon as I have some spare time.

Other well known Open Source applications: Cacti, Shinken or OpenNMS.

Conclusion

These monitoring programs are a very effective way to compensate for all the problems that may face a network. By avoiding as much as possible breakdowns, these tools can significantly improve the availability of a system.
They also give the opportunity to study its evolution in the long term to better assess network capacity and determine whether an upgrade is necessary. These tools do not replace the network administrator, but greatly facilitate its work. And if you have a system administrator that does not use this kind of tools, I'd suggest to replace him.

Network monitoring tools