Well, not exactly. What you really need is a restore solution!
I was discussing this with a colleague recently as we compared difficulties multiple customers are having with backups in general. My colleague was relating a discussion he had with his customer where he told them, “stop thinking about how to design a backup solution, and start thinking about how to design a restore solution!”
Most of our customers are in the same boat, they work really hard to make sure that their data is backed up within some window of time, and offsite as soon as possible in order to ensure protection in the event of a catastrophic failure. What I’ve noticed in my previous positions in IT and more so now as a technical consultant with EMC is that (in my experience) most people don’t really think about how that data is going to get restored when it is needed. There are a few reasons for this:
- Backing up data is the prerequisite for a restore; IT professionals need to get backups done, regardless of whether they need to restore the data. It’s difficult to plan for theoretical needs and restore is still viewed, incorrectly, as theoretical.
- Backup throughput and duration is easily measured on a daily basis, restores occur much more rarely and are not normally reported on.
- Traditional backup has been done largely the same way for a long time and most customers follow the same model of nightly backups (weekly full, daily incremental) to disk and/or tape, shipping tape offsite to Iron Mountain or similar.
I think storage vendors, EMC and NetApp particularly, are very good at pointing out the distinction between a backup solution and a restore solution, where backup vendors are not quite as good at this. So what is the difference?
When designing a backup solution the following factors are commonly considered:
- Size of Protected Data – How much data do I have to protect with backup (usually GB or TB)
- Backup Window – How much time do I have each night to complete the backups (in hours)
- Backup Throughput – How fast can I move the data from it’s normally location to the backup target
- Applications – What special applications do I have to integrate with (Exchange, Oracle, VMWare)
- Retention Policy – How long do I have to hang on to the backups for policy or legal purposes
- Offsite storage – How do I get the data stored at some other location in case of fire or other disaster
If you look at it from a restore prospective, you might think about the following:
- How long can I afford to be down after a failure? Recovery Time Objective (RTO): This will determine the required restore speed. If all backups are stored offsite, the time to recall a tape or copy data across the WAN affects this as well.
- How much data can I afford to lose if I have to restore? Recovery Point Objective (RPO): This will determine how often the backup must occur, and in many cases this is less than 24 hours.
- Where do I need to restore the application? This will help in determining where to send the data offsite.
Answer these questions first and you may find that a traditional backup solution is not going to fulfill your requirements. You may need to look at other technologies, like Snapshots, Clones, replication, CDP, etc. If a backup takes 8 hours, the restore of that data will most likely take at least 8 hours, if not closer to 16 hours. If you are talking about a highly transactional database, hosting customer facing web sites, and processing millions of dollars per hour, 8 hours of downtime for a restore is going to cost you tens or hundreds of millions of dollars in lost revenue.
Two of my customers have database instances hosted on EMC storage, for example, which are in the 20TB size range. They’ve each architected a backup solution that can get that 20TB database backed up within their backup window. The problem is, once that backup completes, they still have to offsite the backup, and replicate it to their DR site across a relatively small WAN link. They both use compressed database dumps for backup because, from the DBA’s perspective, dumps are the easiest type of backup to restore from, and the compression helps get 20TB of data pushed across 1gbe Ethernet connections to the backup server. One of the customers is actually backing up all of their data to DataDomain deduplication appliances already; the other is planning to deploy DataDomain. The problem in both cases is that, if you pre-compress the backup data, you break deduplication, and you get no benefit from the DataDomain appliance vs. traditional disk. Turning off compression in the dump can’t be done because the backup would take longer than the backup window allows. The answer here is to step back, think about the problem you are trying to solve–restoring data as quickly as possible in the event of failure–and design for that problem.
How might these customers leverage what they already have, while designing a restore solution to meet their needs?
Since they are already using EMC storage, the first step would be to start taking snapshots and/or clones of the database. These snapshots can be used for multiple purposes…
- In the event of database corruption, or other host/filesystem/application level problem, the production volume can be reverted to a snapshot in a matter of minutes regardless of the size of the database (better RTO). Snapshots can be taken many times a day to reduce the amount of data loss incurred in the event of a restore (better RPO).
- A snapshot copy of the database can be mounted to a backup server directly and backed up directly to tape or backup disk. This eliminates the requirement to perform database dumps at all as well as any network bottleneck between the database server and backup server. Since there is no dump process, and no requirement to pre-compress the data, de-duplication (via DataDomain) can be employed most efficiently. Using a small 10gbps private network between the backup media servers and DataDomain appliances, in conjunction with DD-BOOST, throughput can be 2.5X faster than with CIFS, NFS, or VTL to the same DataDomain appliance. And with de-duplication being leveraged, retention can be very long since each day’s backup only adds a small amount of new data to the DataDomain.
- Now that we’ve improved local restore RTO/RPO, eliminated the backup window entirely for the database server, and decreased the amount of disk required for backup retention, we can replicate the backup to another DataDomain appliance at the DR site. Since we are taking full advantage of de-duplication now, the replication bandwidth required is greatly reduced and we can offsite the backup data in a much shorter period of time.
- Next, we give the DBAs back the ability to restore databases easily, and at will, by leveraging EMC Replication Manager. RM manages the snapshot schedules, mounting of snaps to the backup server, and initiation of backup jobs from the snapshot, all in a single GUI that storage admins and DBAs can access simultaneously.
So we leveraged the backup application they already own, the DataDomain appliances they already own, storage arrays they already own, built a small high-bandwidth backup network, and layered some additional functionality, to drastically improve their ability to restore critical data. The very next time they have a data integrity problem that requires a restore, these customer’s will save literally millions of dollars due to their ability to restore in minutes vs. hours.
If RPO’s of a few hours are not acceptable, then a Continuous Data Protection (CDP) solution could be added to this environment. EMC RecoverPoint CDP can journal all database activity to be used to restore to any point in time, bringing data loss (RPO) to zero or near-zero, something no amount of snapshots can provide, and keeping restore time (RTO) within minutes (like snapshots). Further, the journaled copy of the database can be stored on a different storage array providing complete protection for the entire hardware/software stack. RecoverPoint CDP can be combined with Continuous Remote Replication (CRR) to replicate the journaled data to the DR site and provide near-zero RPO and extremely low RTO in a DR/BC scenario. Backups could be transitioned to the DR site leveraging the RecoverPoint CRR copies to reduce or eliminate the need to replicate backup data. EMC Replication Manager manages RecoverPoint jobs in the same easy to use GUI as snapshot and clone jobs.
There are a whole host of options available from EMC (and other storage vendors) to protect AND restore data in ways that traditional backup applications cannot match. This does not mean that backup software is not also needed, as it usually ends up being a combined solution.
The key to architecting a restore solution is to start thinking about what would happen if you had to restore data, how that impacts the business and the bottom line, and then architect a solution that addresses the business’ need to run uninterrupted, rather than a solution that is focused on getting backups done in some arbitrary daily/nightly window.
Nathan Schmidt
October 28, 2010 at 9:22 am
I work for a world leading hosting provider. In my professional opinion, you’re on the right track, but I suggest one more mind tweak. One of the concepts that I think many in this situation fail to realize is that, when you’re talking about RTO and RPO, you’re talking about restoring operations. When you talk about restoring operations, you’re talking about disaster recovery. Most of the world’s data enterprises, such as the ones you mention, bank on the fact that so many traditionalists try to use a data backup to restore operations. Many of these, too, would like you to think that backup is disaster recovery. The fact is that your business, and the systems that run your business, are more than the sum of their data. Your systems are based on configurations that use data as a function. The best analogy is that, your business depends on money to function, but if all you do is save money in the bank, that doesn’t necessarily help your business keep running if something happens to the systems that run your business.
It makes sense for the companies you mentioned to continue providing these applications and perpetuating the notion that backups are still critically important — backups consume large amounts of strage space! You’re having to backup multiple copies of the same thing. Even with deduplication technologies, what you might gain in reducing storage space is made up for in the licensing of the deduplication engines. Nowadays, with virtualization and cloud for example, it is my opinion that backups be relegated to backing up critical data, not OS and other redundant bytes, and use VMware V-Motion or some other type of VM resource clustering. This way, the most important part of your business, the configuration of the servers making your business run smoothly are as portable as possible. Most cloud server solution also allow for easy snapshots to allow you to bring your configuration back, just as it was. As a collective of thinkers around backup and data protection, we have to stop synonymously connecting backup and disaster recovery. They are simply two different things.
storagesavvy
October 28, 2010 at 10:39 am
I completely agree with you. Customers need to look at disaster recovery/business continuity more holistically. Backup/Restore is just one of many tools available to help with DR/BC.