From Whence you Came
Great news you’re disaster recovery was a success and you have recovered all your systems and the applications are up and running at your DR location from the failover. Cost right now isn’t too grave of a concern, mainly because your insurance company will be footing the bill but your allotted time is an issue! You’ve been running your systems at the DR facility and your contracted time is up or your fail-over office has become too expensive and restrictive to continue working. What are the next steps to return to normal operations?
In part one I’d like to ouch on the finer points such as the overall strategy and what some considerations should be taken and in part two I’ll take more of a technical deepdive into more overall technologies. This blog entry will reflect the beforehand strategy you should incorporate.
To clarify: Failover operation is the process of switching production to a backup facility whether it be your own or a rented facility. A failback operation is the process of returning production to its original location after a disaster or a scheduled maintenance period. Sometimes the hardest part of disaster recovery is getting back to basic normal business operations after you’ve already recovered. The reason is you have to start accounting for the amount of data change you have going on at the failover location, your available bandwidth, and how your data is being copied, mirrored or replicated while your original site is coming back online. There are plenty of plans for this and I tackled a lot of these technologies in my previous article here.
First determine how long you can run in failover mode effectively before a failback
- Do you have enough storage and tapes at the site to continue normal operations for a determined period of time
- Can you still backup and replicate from your failover site now that it essentially became production
- Do you have adequate staffing and how long will they reasonably be able to stay after going through the DR itself
- Is the hardware at the DR location capable of running your apps at their peak times
Second Have a contingency plan for different types of disaster scenarios.
- Will you have to order new hardware and how long is it going to take your servers to be shipped. Do you have an SLA in place for that?
- Will you fix any structural damage at the location or will operations be moved somewhere else permanently.
- If its just a single point of failure and nothing major make sure you have a response team available who is delegated to for single node issues.
Third examine what was failed over and what is going to be failed back and in what order you should start with.
- It is a good idea to start bringing back your lower tiered applications first because why chance a service interruption on critical business units that are running ok
- If you plan on failing back over the WAN make sure you do it during off peak times so that you don’t interrupt normal business functions
- If you are going to seat the data at your DR site then bring it back to your recovered site don’t forget that your I/Os on the disk will shoot through the roof so try and do that on off peak times as well.
- Do you have a plan in place to understand how much data is changing while in the recovery site? This is what will needed to be gauged when you failback because you need to add that time to the time it took for your DR. Think Add change rate to RTO or should I say RTA (recovery time actual) since you have a baseline.
- Determine whether you can have data being written to your DR site and restored production environment simultaneously so that you can gracefully take the DR site offline and continue operations
Fourth is part of your plan to leverage existing equipment from maybe a third site.
- It’s reasonable to assume that maybe 3 sites won’t go down at once or maybe not depending on your environment and geographic location. Leveraging existing equipment in the facility by breaking a disk mirror and run the device in standalone in order to use the SAN for other operations can be done. Make sure you understand your backup design thoroughly before attempting this though!
- Does your equipment have change recording enabled to resume operations if you use the device for other purposes. Most big vendors have this ability and it tracks/writes all changes to bitmap so that when it becomes unsuspended all changes are cleared and you can go back to normal operations
Fifth have the underlying file systems been considered?
- Often overlooked are the types of file systems associated with the servers in your location. Think of an Application that uses Windows NTFS, Linux EXT4, AIX JFS etc to bring the app online, now can your storage meet these demands?
(On a personal note I can’t say I like anything better on the market right now better than ZFS as an underlying system. In fact every system listed there can run on top of it I am praying that Oracle does something useful with it)
- If you have say ZFS or VMFS does the failover hardware coincide with the correct blocksize? For instance you may have hard drives that don’t meet the requirement of running 4k sectors for ZFS which can present a problem when coming back. (but with ZFS I could take the hard drives right out of say a FreeBSD server and stick them in a Solaris server and it would continue running seamlessly.)
- Just remember that the latest and greatest hardware isn’t always the best option if you have a check in your hand to get new equipment. Make sure your stuff will run on it firstly that’s all.
Lastly what to do with the equipment once it’s failed back. This also depends on a few scenarios but I will go with the two most popular ones.
The one first being a warm failback from an already owned secondary site that has no pressing time limitations. This is really ideal because we can continue running the systems and simultaneously writing data to both locations. This will really give you the time and level of comfort to decide when to stop running as a DR and go back to normal processes. Since the servers and equipment belong to you a more carefully executed plan can be put into place before putting the servers back into stanby mode.
Security and disk access are only as much of a concern as the facility they are located in. You have no real issues or concerns of leaving behind data in fact it shortens your re-sync time back once operations resume!
The second issue is if you are in a rented or borrowed space. You will have to remove all data from their systems and get off the premises. With this in mind the DR plan must be sound as well as continual operations at your facility.
Removing your data from the drives needs to be done and securely. Please consider this when making an SLA based on time because you need time to delete your sensitive data from the drives. Someone could potentially run data recover software on your hard drive and recovery those files you deleted. I personally like to use software that does a secure erase which means it writes 1’s and 0’s to the disks so that the data has been permanently over written. There is a cool tool I have named MediaTools Wipe by Prosoft engineering that gives you the ease and flexibility to do this for up to 18 drives simultaneously.
MediaTools Wipe works with any IDE, SATA, USB, eSATA, or Firewire connected disk or flash drive that is accessible by the operating system. External hot-swappable multi-drive enclosures so whatever means you put data on it can be taken off and that is key. I did a few screen shots below after loading the server with the CD.
The first screen is the license agreement
The second is where you put your license in
Next pick a drive that DON’T want deleted.
Then you get prompted to save your settings. If I had to guess I’d say the underlying software is Ubuntu based on the graphics.
Then you get a server looking Chasis graphic with available Hard drives to erase and it shows some of the raw input sitting on the disk
After you pick your drive press start to begin writing 1’s and 0’s to it
Do the standard confirmation that you are going to blow away your drives
When you’re finished check the bottom screen for success then hit close.
After that’s done I also suggest entering the BIOS on your X86 servers and breaking the RAID to double destroy the data. You can never be too safe.
See you soon for part 2