Monitoring

Monitoring is the art or science of keeping tabs on various systems and services and to alert or carry out various actions should something be awry. This overlaps to some degree with init services (e.g. runit), which may have some of the "check on a service and restart or warn about it" that monitoring does, or configuration management software (e.g. Ansible) which also can check for processes running and run code as necessary to correct some condition.

Monitoring services typically have a lot more checks (breadth) and can check in more detail (depth) as to how a service is behaving, not just whether a Gemini server is up, but how much CPU the service is using, whether a particular URL contains particular content, and whether the contents of the URL can be accessed in a certain amount of time (service latency). Monitoring software may also have highly configurable alerts, escalations, and other such logic, though some of that may overlap with what a trouble ticket system provides. Monitoring may also be closely tied to local documentation; an alert could include a gemini link of how to troubleshoot that particular issue.

The complexity of the monitoring will depend on the complexity of the site; a small virt running in the cloud may not need any monitoring, if one does not care if the site is down. In this case one might rely on friends to tell you, "hey, your site is down", or a human monitoring service. Some tech folks may need to socialize more. Larger sites, especially those where actual money is on the line or where contracts demand various levels of service availability really should have better monitoring and alerting in place.

Monitoring software can be very complicated and difficult to setup (e.g. Nagios), at least the first time, so there can be a high capital cost for initial deployments, and then you may be stuck with that monitoring service until there's enough free time available (tuits) to implement something else.

Monitoring can also suffer from the "boy who cried wolf" problem in that if too many alerts are generated, the IT staff may take to ignoring the metaphorical boy warning about dangers that are almost but not always an issue. Therefore monitoring may need a means to limit the alarms (take care in setting up what you alert for), or the alarms could be fed into yet another service that limits and rate throttles them (more complicated). Again there is some overlap with a trouble ticket system here, or larger sites may need a system of escalation if the alerts are falling on deaf ears (or maybe the on-call is so busy fighting some fire that they cannot deal with the latest alarms). Management may not be amused if you do not respond to their messages in time, but when there's flooding in a server room, or a UPS on fire…

Monitoring can be done manually, e.g. to login to a server and keep an eye on things with top(1) or systat(1), but this does not scale very well, does not typically keep a historical record over time, and does not automate the checks or any recovery code. Hence the use of monitoring software.

Monit - utility for monitoring services on a Unix system

Custom monitoring can also be written, like maybe you want to do deep checks on a service for latency that is not captured (or captured well enough) by existing monitoring services. This could be a standalone system, stood up for a particular need, or integegrated into a more complicated monitoring system that will graph your custom data over time.

Monitoring need not only apply to computer systems; satellite images stored over time can provide before and after snapshots from which better understanding of Earth systems and possible better monitoring and alarms of future events may be derived:

"The Himalayas buried this town without warning...what actually happened? | Dharali disaster". TheGeoModels. 2025.

Documentation is also important, both for the overall view or goal of the system, and what to do when particular alarms fire. The on-call may not be operating under the best of conditions (stress, lack of sleep, etc) so links with simple checklists to follow and next steps for escalations may make their life less bad.

Monitoring is not without costs; it takes energy to maintain the monitoring system, and there may be limited "slots" available or service charges that prevent certain forms of monitoring. Despite protests by the IT staff, monitoring cannot take up too much of the budget, just as Music cannot consume too much money, protests by one Herr Bach aside. If there are limits, application code may need to be modified to carry out various monitoring tasks that a monitoring system would normally do: health checks, latency measurements, alert generating. A monitoring system that is too expensive or too complicated to maintain may suffer from a "loss of theory" as new hires cannot figure it out and leave it to rot. So these things go.