Server Outage Post-Mortem
On April 7 and 8, two of our staging servers suffered a file system failure. We were able to restore one server within a few hours, the the other took the better part of a day to restore. We want our customers to be able to rely on our servers, so here is a post-mortem of what happened and what we are doing to make sure it doesn’t happen again.
This is the first time in WP Stagecoach’s history that we have had a complete server failure. We are doing everything we can to make sure it is the last!
We make a backup of our servers every night, around 2am Pacific Time. On the morning of April 7, one server crashed during the backup and its boot loader got corrupted. The same thing happened (after the backup) to a second server on April 8. That means that the servers couldn’t boot back up. Our investigation of the crash showed that he problem came from a conflict with our cluster software and the boot filesystem.
How We Fixed It
In both cases, we had to restore from backups. The first server needed an entire restore, whereas the second (smaller) server only needed a partial restore. The first restore took several hours, which unfortunately meant that server was down for most of the day, whereas the second was down for around two hours.
What We Are Doing To Make Sure It Doesn’t Happen Again
We want to make sure that our servers are always up for our customers, so naturally we are taking steps to make sure that this doesn’t happen again. We have begun to migrate all our production servers to new physical hardware with a different boot filesystem, as well as continuing to work with our vendor to make sure this is fixed.
What We Learned
The biggest thing we learned out of this is that we need to have a procedure in place for communicating with customers when something like this happens. We were so busy reacting to the situation that we did not take the time to set up a good channel of communication with our customers. Since then, we have set up a better communication procedure so that our customers aren’t left in the dark. Let us hope that we never have to use that procedure.