Instances with Failed Status Checks
What it does: matches instances that fail the system/reachability status checks
This bot identifies compute instances which fail instance/system reachability. When failure occurs this means that the system is not accessible over the Internet and likely is running failed hardware. It is strongly encouraged to migrate these failed systems to new hardware.
Why do I care?
Monitoring the lifecycle state and status checks of your cloud instances ensure that your systems are running properly AND that you have access to the compute capacity that you are paying for. Instances that fail status checks can result in downtime for your organization and wasted money.
Failed Instances Bot checks your system every 10 minutes and automatically migrates your data from failed or failing hardware in AWS.
Why do failed instances occur?
Within Amazon there are two states for availability: lifecycle state and status checks. The lifecycle state defines whether an instance is running, stopped or has been deleted. Status checks determines whether the virtual instance your application or data is running on is working properly. Amazon sends periodic heartbeats to the underlying hardware, at the process, hypervisor and network layers to test for status checks and lifecycle. If you fail any one of those checks your instance is unreachable, resulting in unusable data.
Status checks are an important because many people only monitor the lifecycle state of their instances, and do not monitor network accessibility. Turning on this DivvyCloud “Instance with Failed Status Checks” Bot is very useful, because most monitoring detects lifecycle changes and not status checks. In the situation of a failed status check, organizations typically migrate the system to a different droplet, or need to conduct deeper inspection and remediation.
It’s not uncommon for hardware to fail
Hardware failure can come from a number of reasons. Equipment can get old, moving parts can break, overheating can occur and hard disks can fail. These failures are not uncommon at scale, and knowing this can help your organization better prepare for and react to failures. If you don’t have automation in place to detect system failures, it can be hard to find the affected systems. Cloud providers may email you, but often not fast enough for applications high uptime requirements. It could be an hour or two before you are alerted about the problem. Our Bot will identify failed instances in a maximum of 10 minutes using BotFactory’s continuous API-based data harvesting. This is in much more near real-time than if you wait for the AWS monitoring system to catch the failure.