In my new role at EMC, I am one of the first people to learn of major problems that my customers experience. In general, customers seem to call their sales team before technical support when a big problem happens. In the past week, I’ve been involved in recovery efforts with two different customers, both resulting from complete power outages in their production datacenters.
Both of these customers process millions of dollars through their global customer facing websites. The smaller customer of the two does not have a disaster recovery site of any kind, while the other (larger) customer does have a recovery site, but it is not designed for 100% operation and is hundreds of miles away.
What became clear through both of these incidents is that having a very clear, very well known recovery plan is critical to the business. Interestingly, these experiences drove home the point that even if you don’t have a recovery site, aren’t using replication, and otherwise don’t have any way to recover the data offsite, you still need a plan that encompasses what you CAN do. More often than not, major outages are short lived and you will be recovering in your primary datacenter anyway, so you need to have a pre-determined plan to prevent major issues and shorten the time to recover.
Here are some things to think about when creating a recovery plan:
- Get the application owners together and build a list of all the applications running in your environment. Document the purpose of each application and map dependencies that each application has on other applications.
- Next, involve the server/systems admins and document the server names, database names, IP addresses, and DNS names for each application on the list.
- Finally, involve the infrastructure teams (storage, network, datacenter) and document the network dependencies (subnets, routers, VPN connections, load balancers, etc). Document any SAN storage used by the servers/applications. Also document how each infrastructure component affects others (ie: the SAN switches are required to be operational before servers can connect to storage arrays.)
- Work with business leaders to prioritize the applications. The idea is to understand how much impact each application has to the business both from a productivity perspective as well as direct financial impact. There may be legal requirements or service level agreements with customers to consider as well.
- If possible, identify the maximum amount of time each application can be down in the event of a catastrophic event (RTO – Recovery Time Objective) and how much data can be lost without significant impact to the business (RPO – Recovery Point Objective). These metrics are usually measured in minutes, hours, and days.
- Document the backup method for each server and application. How often are backups run? What is the retention period? How long does it take to complete backups? What is the expected time to restore the data? How long does it take to recall tapes from offsite storage?
- At this point you have a prioritized list of applications, now build a step by step recovery plan that lists the exact order in which you must recover systems. The list should include server names as well as validation points to ensure certain systems are working before moving to the next step. For example:
- Step 1: bring up the network switches and routers
- Step 2: bring up the DNS/DHCP servers
- Step 3: bring up Active Directory servers
- Step 4: bring up SAN fabric switches
- Step 5: bring up SAN storage arrays, verify health of arrays with help from vendor
- Step 6: …
I recommend that one of the first steps before starting recovery is to contact your key vendors (storage array vendors at least) to notify them of your outage so they can get support resources ready to troubleshoot any hardware issues you may run into during the recovery.
- Identify key players needed in a recovery, at least primary and secondary contacts for every application and vendor contacts for hardware/software, facilities, UPS/Generator support teams, etc.
- Establish a standard communication plan to include at least the following…
- A method to notify employees of an outage and give instructions
- A method to notify key players for recovery
- A mechanism for key players to communicate with each other during the recovery
- Personal (not corporate/business) contact information for all of the key players
The key thing to remember here is that you cannot rely on any communication tools that are part of your infrastructure. You must assume your PBX/VOIP system will be down, Email will be down, corporate instant messenger will be down, Sharepoint will be unavailable, etc.
- If you have a remote recovery site, with or without replication technology, and intend to use the remote site to recover production applications in the event of a large failure, be sure to document the triggers for moving to the recovery site. As an example, you may want to attempt recovery in the primary site, and then move to the recovery site if recovery at the primary site will take too long — be sure to document that time and get executive buyoff. You should not hear “how long do we wait until we move to the DR site?” during an active recovery operation. That decision needs to be made during the planning exercise.
- Document the entire plan and store the digital copies in a readily accessible place (file shares, Sharepoint site, etc). Keep additional copies on USB sticks or CDs stored in a safe place. Keep even MORE copies in another location outside the primary datacenter facility (ie: safe deposit box, remote office safe, etc). Print copies as well and store the printed copy in similar safe places. Assume that a building may not be accessible due to fire or flood. I know one customer who issues fingerprint secured USB sticks to every manager. Each manager must sync their USB stick to a server at least monthly or upper management is notified.
- Make sure that everyone is aware of the recovery plan, who has access to the plan, where the copies are stored, and what role each of the key players is expected to play during a recovery.
There is far more to think about but hopefully you can get a good start with what I’ve listed above. If you have a recovery plan already, you should review it regularly and think about anything that needs to be added or modified in the plan.
If you are trying to get approval for a remote recovery site and replication technology and having trouble getting executive approval, going through this exercise and defining application priority with RPO/RTO for each could give you the ammo you need. Traditional backup architectures aren’t designed for RPO’s under 24 hours while storage array based replication can get RPOs down into the minutes and restoring from tape takes way longer than restoring from replicated data.
Last but not least, keep the plan updated as your environment changes, add new application and server details to the plan as part of the implementation process for new applications, or as part of change control procedures for significant changes to the infrastructure.