“The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency.”
Bill Gates
In this article I’d like to stay away from the normal buzz words everybody uses and actually touch on some technologies and how things can be automated. Everywhere seems to say the same thing on automated recovery “continuous data protection, replication, de-duplication and then virtual machines”. All of that is great but most seem not to be able to actually show you what they are talking about outside of the normal buzzwords so here we go.
Let’s breakdown some of those buzzwords and put some practical knowledge around them so that we can start to use real life examples of them.
- Continuous data protection (CDP) – Is when an appliance or software allows you to have continuous backups or a real time backup that is set to time intervals or data size increments. Huge amounts of data are sent over the network so bandwidth is a big issue and assigning bandwidth priorities is also an issue. This is a very efficient way to keep your RPO in check but also a very expensive way to backup.
- Replication – Basically the same data on multiple storage devices but you can set their synchronization times. You can replicate at a file level which syncs files and folders the traditional way or use block level replication which syncs 0’s and 1’s from devices and doesn’t care about formats or what kind of data it is. Block level is very efficient when replicating a storage device that has multiple file types on it (NTFS and UFS etc).
- De-Duplication – Is really just a compression technique to make smaller backups by cross checking certain data patterns on a target device and only sending newer ones. In a nut shell if nothing changed nothing will be sent over.
- Virtual Machines – Are a software implementation of a real machine that can execute programs and utilize existing system architecture. The advantage VM’s hold in a DR scenario is when you recover a Server it may have 30 -50 machines running inside of it. For some people it can even be their whole DR in a single box, simplicity at its finest.
With those now understood we can start to drill down from there and clearly note that we need a few things (Internet, Storage devices, a place to recover and staff to execute and validate the technology.) One thing to remember is that regardless of the level of automation somebody or some team will always need to validate it and be near by with a crash kit. Whatever and wherever you failover to people will need to continue working and you will need some measure of quality assurance in place.
Firstly without an ISP we can’t do anything. There needs to be Routers, load balancers, multi-layer switching etc. for data to travel. For them to know where to travel to we need to have DNS and and MX record for mail traffic. In working with your ISP you should be able to set multiple DNS records its just a matter of how long you request a non-response that they will push your traffic to the next DNS address. This is something to review with your service provider and your risk team because it needs to be decided who has the authority to make the change at your service provider and also if you want it to be automated.
See this infograph provided by totaluptime.
Some of your more mature applications like SAP for example needs to have traffic routed across the enterprise. Either by a router, load balancer or layer switching so that large amounts of data and departments that it handles can be routed correctly especially if on different VLANs and networks. It is essential to have these standby routers and switches in place with the same rule-sets you have in production or data won’t be able to flow properly after the failover takes place.
Now that we have a clear path for the data to failover and secondary addresses that are automated to find the next failover place if a site goes down. We have our communication flow staged now we need to put in the hardware. This is what we like to call the Tier 0 phase and this is the essential hardware to stage the 4 examples above. SAN/NAS, Backup server, Virtualization host in a standby state, Replication software on live system. Think of this area as a giant target or a bullseye just sitting on the other end of a data pipe for your communications.
First step is a process called seating the SAN/NAS. This process is done by doing your initial full data sync from device to device locally since it is too huge of an amount of data to do over the WAN. To calculate the amount of bandwidth that your replication/CDP requires, you must calculate your average data change rate within an RPO period, divided by your link speed.
- Identify the average data change rate within your RPO by calculating the average change rate over a longer period and dividing it by your RPO.
- Calculate how much traffic this data change rate generates in each RPO period.
- Measure the traffic against your link speed.
For example, a data change rate of 100GB requires approximately 200 hours replicating on a T1 network, 30 hours to replicate on a 10Mbps network, 3 hours on a 100Mbps network, and so on.
I personally like to use Wireshark to get your data estimates and another good network monitoring tool is Packettrap by Dell (windows only) or Ntop *NIX only. Any of these tools will give you a nice graph to show how much data your synchronizations will use and if your current WAN is optimized for it.
OK now comes the Tier 1 applications (see my earlier article on how to create a DR plan that covers tiering apps) one in particular I’d like to hit is Active directory. I’ve heard a million times from people that they aren’t built for DR and have X amount of AD servers everywhere so it’s not an issue; that is until we are talking automation. If we are failing over a site and want an AD server to have the same FSMO roles as in production, a standby AD server won’t be able to automatically have those roles come up automated like a virtual machine can. It will require seizing roles and metadata cleanup that is outside the scope of this article to discuss. The only real way to bring that server online from production is to have it virtualized and failover at the DR site.
So let’s talk about some virtualization technologies and their features for automation at time of DR. Probably the biggest and most popular being VMware and SRM(Site recovery Manager)/HA. Below is an example of SRM and its features. Once you have the VMware plugin installed and configured (out of scope for this article) just click on site recovery from vCenter.
As you can see the power on options below by priority groups executes your automation for you. In this orchestrated sequence of events VMware will automate your disaster recovery at your target site(s) and power on your servers by groups or by applications. There is also a test failover feature which writes all changes to a journal so you can continue testing your DR scenario and once done it will blow away the journal and revert back to its previous state.
Next I’d like to talk about Citrix XenServer, it was pretty painless to configure and setup. Once you have your metadata synced between sites (which is basically like vCenter where it keeps track of all your configurations and settings)
1. On the primary pool, choose Pool -> Disaster Recovery -> Configure -> and selected a SR to store the configuration data on
2. Once the initial sync (Seating) is complete set the DR LUN to read\write
3. On the DR pool, choose Pool -> Disaster Recovery -> Disaster Recovery Wizard -> Test Failover -> Next -> Next -> I chose the mirrored LUN (now read\write) -> Next -> on the ‘Select the vApps or individual virtual machines to fail over to the target pool window
XenServer will now attach the SR, create the VMs based upon the VM meta-data previously saved and and start the VMs you selected
Once all is done press that failover button in the corner and you’re done. It doesn’t have as many features as SRM but it does get the job done and it doesn’t have processor architecture restrictions on other servers like VMware! With Citrix you can be running on an AMD in production and failover to a server with Intel flawlessly.
Well what about all of the critical servers in the environment mainly the UNIX ones? Well lets finish off part one with Solaris then. Oracle Solaris (I feel weirs saying that) has built in virtualization features called containers aka Zones. Oracle Solaris Cluster is the high availability (HA) solution for Oracle Solaris, offers close integration with Oracle Solaris Zones and extends Oracle Solaris 11 to provide a highly available infrastructure for deploying virtualized workloads.
Oracle Solaris Cluster provides two different types of configuration for Oracle Solaris Zones on Oracle Solaris. Oracle Solaris Zones clusters extends the Oracle Solaris Zones model across multiple clustered nodes to a virtual cluster. This feature allows you to protect applications running within the zones through policy-based monitoring and failover. It also enables reliable operation of multitiered workloads in isolated “virtual” zone clusters. (For more information, see the Zone Clusters—How to Deploy Virtual Clusters and Why white paper.)
In addition to zone clusters, Oracle Solaris Cluster offers a means for protecting the zone itself: the failover zone. This zone is considered to be a black box, and it is monitored and controlled by the Oracle Solaris Cluster HA agent for zones, which starts, stops, and probes the zone. The agent also moves the zone between servers in the event of a failure or upon an on-demand request.
This article describes how to set up a failover zone on a two-node cluster. For more details, check the Oracle Solaris Cluster Software Installation Guide. If you would like to see how to configure the actual failover check out this article that will walk you through it.
Well I hoped you like part 1 in the next article we can tackle a little more virtualization, Linux and Physical recoveries stay tuned.