Computers are finicky. As stable and reliable as we would like to believe they have become, the average server can cease to function for hundreds of different reasons. Some of the common problems that cause websites or services to crash can’t really be avoided. If you suddenly find your site suffering from a DDOS attack or a hardware failure, all you can do is react to the situation.
But there are many simple things that are totally preventable that can be addressed proactively to ensure optimal uptime. To keep an eye on the more preventable issues, setting up monitoring for your entire stack (both the server as well as the individual applications) is helpful. At Zivtech, we use a tool called Sensu to monitor potential issues on everything we host and run.
Sensu is a Ruby project that operates by running small scripts to determine the health of a particular application or server metric. The core project contains a number of such scripts called “checks.” It’s also very easy to write custom checks and they can be written in any language, thus allowing developers to easily monitor new services or applications. Sensu can also be run via a client server model and issue alerts to members of the team when things aren’t behaving properly.
Server checks
As a general place to start, you should set up basic health checks for the server itself. The following list gives you a good set of metrics to keep an eye on and why it is in your best interest to do so.
RAM
What to check
Monitor the RAM usage of the server versus the total amount of RAM on the server.
Potential problem monitored
Running out of RAM indicates that the server is under severe load and application performance will almost certainly be noticeable to end users.
Actions to take
Running low on RAM may not be a problem if it happens once or twice for a short time. Sometimes there are tasks that require more resources and this may not cause problems, but if the RAM is perpetually running at maximum capacity, then your server is probably going to be moving data to swap space (see swap usage below) which is much slower than RAM.
Running near the limits of RAM constantly is also a sign that crashes are eminent since a spike in traffic or usage is surely going to require allocating resources that the server simply doesn’t have. Additionally, seeing spikes in RAM usage may indicate that a rogue process or poorly optimized code is running, which helps developers address problems before your users become aware of them.
Linux swap usage
What to check
Check swap usage as a percentage of the total swap space available on a given server.
Potential problem monitored
When the amount of available RAM is running short or the RAM is totally maxed out, Linux moves data from RAM to the hard drive (usually in a dedicated partition). This hard drive space is known as swap space.
Generally, you don’t want to see too much swap space being used because it means that the available RAM isn’t enough to handle all the tasks the server needs to perform. If the swap space is filled up completely, then it means that RAM is totally allocated and there isn’t even a place on disk to dump extra data that the system needs. When this happens, the system is probably close to a crash and some services are probably unresponsive. It can also be very hard to even connect to a server that is out of swap space as all memory is being used completely at this point and new tasks must wait to run.
Actions to take
If swap is continually running at near 100% allocation, it probably means the system needs more RAM, and you’d want to increase the swap storage space as part of this maintenance. Keeping an eye on this will help ensure you aren’t under-allocating resources for certain machines or tasks.
Disk space
What to check
Track current disk space used versus the total disk space on the server’s hard drives, as well as the total inodes available on the drives.
Potential problem monitored
Running out of disk space is a quick way to kill an application. Unless you have painstakingly designed your partitions to prevent such problems (and even then you may not be totally safe), when a disk fills up some things will cease working.
Many applications write files to disk and use the drive to store temporary data. Backup tasks rely on disk space as do logs. Many tasks will cease functioning properly when a drive or partition is full. On a website running Drupal, a full drive will prevent file uploads and can even cause CSS and JavaScript to stop working properly as well as data not being persisted to the database.
Actions to take
If a server is running low on space, it is relatively easy to add more. Cloud hosting providers usually allow you to attach large storage services to your running instance and if you use traditional hardware, drives are easy to upgrade.
You might also discover that you’ve been storing data you don’t need or forgot to rotate some logs which are now filling up the drive. More often than not, if a server is running out of space, it is not due to the application actually requiring that space but an error or rogue backup job that can be easily rectified.
CPU
What to check
Track the CPU usage across all cores on the server.
Potential problem monitored
If the CPU usage goes to 100% for all cores, then your server is thinking too hard about something. Usually when this happens for an extended period of time, your end users will notice poor performance and response times. Sites hosted on the server might become unresponsive or extremely slow.
Action to take
In some cases, an over-allocated CPU may be caused by a runaway processes but if your application does a lot of heavy data manipulation or cryptography, it might be an indication that you need more processing power.
When sites are being scraped by search providers or attacked by bots in some coordinated way, you might also see associated CPU spikes. So this metric can tip you off to a host of issues including the early stages of a DDoS attack on the server. You can respond by quickly restarting processes that are misbehaving and blocking potentially harmful IPs, or identify other performance bottlenecks.
Reboot required
What to check
Linux servers often provide an indication that they should be rebooted, usually related to security upgrades.
Potential problem monitored
Often after updating software, a server requires a reboot to ensure critical services are reloaded. Until this is done, the security updates are often not in full effect.
Action to take
Knowing that a server requires a reboot allows your team to schedule downtime and reduce problems for your end users.
Drupal specific checks
Zivtech runs many Drupal websites. Over time we have identified some metrics to help us ensure that we are always in the best state for security, performance, search indexes, and content caching. Like most Drupal developers, we rely on drush to help us keep our sites running. We have taken this further and integrated drush commands with our Sensu checks to provide Drupal specific monitoring.
Drupal cron
What to check
Drupal’s cron function is essential to the health of a site. It provides cleanup functions, starts long running tasks, processes data, and many other processes. Untold numbers of Drupal’s contributed modules rely on cron to be running as well.
Potential problem monitored
When a single cron job fails, it may not be a huge problem. But the longer a site exists without a successful cron run, the more problems you are likely to encounter. Services start failing without cron. Garbage cleanup, email notifications, content updates, and indexing of search content all need cron runs to complete reliably.
Action to take
When a cron job fails, you’ll want to find out if it is caused by bad data, a poorly developed module, permissions issues, or some other issue. Having a notification about these problems will ensure you can take proactive measures to keep your site running smoothly.
Drupal security
What to check
Drupal has an excellent security team and security processes in place. Drush can be used to get a list of modules or themes that require updates for security reasons. Generally, you want to deploy updates as soon as you find out about them.
Potential problem monitored
By the time you’ve been hacked, it’s too late for preventative maintenance. You need to take the security of your site’s core and contributed modules seriously. Drupal can alert site administrators via email about security alerts, but moving these checks into an overarching alerting system with company wide guidelines about actions to take and a playbook for how to handle them will result in shorter times that a given site is vulnerable.
Action to take
Test your updates and deploy them as quickly as you can.
Be Proactive
It’s difficult to estimate the value of knowing about problems before they occur. As your organization grows, it becomes more and more disruptive to be dealing with emergency issues on servers. The stress of getting a site back online is exponentially more than the stress of planned downtime.
With a little bit of effort, you can detect issues before they become problems and allocate time to address these without risking the deadlines for your other projects. You may not be able to avoid every crash, but monitoring will enable you to tackle certain issues before they disrupt your day and will help you keep your clients happy.