In any company, a member of the leadership should be able to randomly ask a senior member of the tech staff to run a failure simulation. This simulation is a process whereby the tech team would then work to restore a system (in this case, a website) from their backups as if it had catastrophically failed for an unknown reason (or been compromised by a really talented outside entity). Pro tip: this can be done on any clean server—just deploy a new instance and Go.
Similarly, most modern offices will regularly have a fake fire drill—which may or not be mandated by the local fire department or city requirements, and we should also be doing this with our cybersecurity efforts. “Stop, drop, and restore from backups,” should be the tech team’s motto. These restoration drills should also be timed, so you can actually predict how long it would take in practice.
I’ve worked at a few companies with high visibility websites where things went haywire for one reason or another, and restoring is always an incredibly stressful event—unless you do it often. You should absolutely have three things:
- An up-to-date backup of your entire web platform
- A checklist for restoring from the backup (written as explicitly as possible)
- A routine for practicing the process of restoring from the backup
Let’s dive in a little on these deceivingly simple requirements. First, there is no way to verify the quality of your backup process until you backup from scratch. I really want to emphasize this point, since most teams will rest easy with the backups finished, but until these backups and the restoration process are verified—you don’t have a reliable plan.
Furthermore, if your website goes down randomly, and you have a high viewership, the enormity of the stress can be hard to imagine. If you are working on a battle-hardened platform, and it goes down, it’s likely because something is very wrong, something that doesn’t necessitate a simple server restart, or analysis of your caching systems—it’s usually something new and hard to track down. When this has happened to me on servers serving content to huge audiences, it is very stress inducing. The first thing a team will think is, “am I going to get fired, and if not, will one of my friends get fired … how did this happen … whose fault is this?” There is a common sequence of events from the moment the site goes down:
- Alerts are Triggered: Sysadmins or tech team staff should have been sent a notification from an automated system (that regularly monitors site availability) that the site is down. If you don’t have a good system for managing your uptime, try Pingdom or similar—notifications should be deployed within seconds. You can also do this with your own cron monitoring script and send out text messages via Twilio.
- Team Finds Out: Team members who received notifications may not see them for some amount of time. Maybe they’re on a date, or at a concert, or their phone is silenced at a movie. Make sure your need-to-know teammates have redundant habits for knowing when things go awry. This could be a very big deal if you are working on a high visibility site—since viewers will make up their own reasons for why your site went down. “Was it hacked, are they incompetent?”
- Problem is Assessed: Someone on the tech team sees the terrible news first. Now, how quickly can they get to a terminal or laptop to evaluate how bad the situation really is? Again, this lag is really hard to predict unless your team is prepared. I used to live blocks away from my office and would be on call overnight—and have spent time sprinting through the cold winter air at 3am to deal with this kind of thing in person (since we had strong security measures pertaining to outside system access). I hope your running times are good.
The previous few points could be as quick as five minutes, or as slow as an entire night. You don’t want to wake up to text messages from people outside of your company that the site is down.
Now that you have a member of the tech team on the case, how do you isolate the problem? Again, this is a fluffy idea since their individual competencies may vary enormously. Do they tail the log files (a term for only loading a handful of recent entries from end of the server’s error logs) to try and catch the any recent glitches, run through a checklist of items like restarting NGINX, or try hard-restarting the server? At what point do they decide that this is a potential worst-case scenario and that they should restore from their backups? What do they do next?
Well, let me help you here… .
Tracking down errors on a live server can potentially take days, depending on the error. Maybe you get lucky and just need to restart the server, or you received a traffic spike and need to upgrade your server’s memory and bandwidth allocation with the hosting company—easy peasy. In my experience, these situations happen periodically, and aren’t the situations I’m referring to here.
The best way I know of to handle this situation is mirrored server redundancy—which means that you always have multiple servers in different geographic locations running identical copies (or as close as possible) of your website. In this case, when the site goes down, your system should automatically redirect to an unaffected server, and if this can’t be done, manually updating the DNS to point to an IP address with a safe copy should be a failsafe. It’s fairly unlikely (as in, insanely unlikely) that someone would be able to take down mirrored copies around the globe, although somehow gaining control of your DNS server or domain itself would be just as problematic.
Since most startups aren’t doing anything this sophisticated, errors are bound to happen, if only because they finally get mentioned in TechCrunch (and exceed the bandwidth for their current plan). As they are probably using Amazon Web Services (AWS) or DigitalOcean, or yes, sigh … Heroku or something, they might falsely think that their hosting company will handle the scaling automatically. Many hosted solutions will auto-scale your virtual server to account for additional traffic, and kindly bill you accordingly (and there are attack techniques, like DDoS, that can be used simply to increase your traffic bill). This doesn’t account for the entirety of the problems that occur because of a poor understanding of the server architecture, memory limits, system compatibility, patches, Memcached configurations, etc. Instead of going on and on about various mix-and-match possibilities, let’s just say, instead of worrying about what may happen, work to define your process in terms of “acceptable times to recover from various worst-case scenarios.”
What are you willing to lose? For me, if my blog gets pwn3d, I lose face, since I write about security; my ego gets bruised and I go to bed sad. Some of my friends will wonder what happened, and after a while, I get up and get back on the horse. Since I have live copies of the entirety of my server, it’s more of an annoyance than a critical issue. The fun of it all is trying to figure out who “got me,” and how they did it. Unraveling the forensic puzzle begins, and I get to learn something new. My point is that if I didn’t have redundant copies of my work, and they actually deleted the only copy of my blog, I would be really bummed. I would have lost all of my recent writing, and it would really hurt. My worst-case scenario is losing face, and the time it me takes to deploy a fresh copy is in the neighborhood of a few minutes (unless DNS updates are required). Losing face is an annoyance, but since I could easily deploy a mirrored copy of my blog quite quickly, virtually no one would actually know. Without getting more into the cat and mouse of how we could escalate the situation, my goal is to emphasize the difference between losing face, and losing thought products that aren’t backed up.
At large news organizations, the worst-case scenario is losing X number of stories (since the news team is constantly cranking out new content), Y number of user comments (since there could be hundreds), and Z number of visitors (since people get frustrated when sites are down—or in the case of Reddit going down, just refresh until those with limited patience drop off).
You can actually quantify the financial losses due to a site’s downtime by using your analytics platform to determine: a) the estimated number of visitors and page views for a comparable period, and b) the dollar amount of potential advertising revenue that was lost during an incident. The business could say, “we lost $20,000 during the two hours the site was down,” whereas members of the tech team could lose hundreds of thousands if they get canned (losing their high-paying job) and don’t get a good reference for their next job (which they would now be on the lookout for).
The worst part is that many of these failures have little to do with incompetence (in my experience) but have a lot to do with random occurrences that are hard to plan for—ranging from server hard drive failures all the way to rogue staff that have gone postal. If your process is solid, restoring under pressure will be a walk in the park—which is a good feeling when all eyes are on you.
Lastly, all procedures for the process of backing up should be written as forthright as possible. You should, relatively speaking, be able to give the directions to anyone on the tech team and have them complete the entire process without having deep expertise in any of the individual technologies. If your main sysadmins are out of the country, or busy, this could end up saving the day. Keep it simple, stupid.
Nick is the Founder & CEO of MetaSensor, a venture-backed internet of things startup located in Silicon Valley, and a Behavioural Product Designer at Duke's Center for Advanced Hindsight (with Dan Ariely et al.). | Read Full Bio »
Creating Content for Mass Distribution …
I have noticed a very interesting occurrence when it comes to the act of creation—regardless of if you’re recording a podcast, creating a new vlog, or …