Step 1. Understanding what is involved in a Disaster Recovery Plan?
I’m sure your first guess would be recovering servers back to their original state prior to the disaster would be the correct answer? Which makes sense but it’s actually much more intricate and involved than that and depending on the size and complexity of the organization things can get much trickier. Disaster recovery is really about bringing back the applications that the business uses to make money and having the data for the applications be as close to where is was prior to the disaster as possible.
I’d also like to clarify what the business is, as it’s referred to above. What needs to be understood is that the business is the part of the company that pays for the servers and salaries and they are what you need to care about when dealing with DR. It is very beneficial to step outside your normal role for a moment to understand what the business is expecting to see after a failure and how to set expectations and budgets around that idea. Most time the auditors, SOX, HIPPA or insurance company compliance officers will make your company do a business impact analysis which can set your course in DR. The “BIA” will show how much money is lost by the minute and which applications make them that money this is something to make great note of and work backwards from there in setting up recovery plans.
Step 2. Understanding what makes up an application?
Let’s start with what it’s not and that is a group of servers… I’ve seen it countless times when the team has recovered the “servers” and the actual users can’t log into the application and you hear “hey the servers are back what more do you want from me?” While the IT team is patting themselves on the back because the servers rebooted everyone else is sharpening their knives. I’m here to try and steer you away from that pitfall and set the course for success.
Start by thinking of an application as an IT eco system (a system formed by the interaction of a community of organisms with their environment.) Just as in nature there a multitude of different types of ecosystems that make up a certain environment the same bodes well within an application. There are multiple lines of business that use a single application and each one probably has a third party plugin etc that makes it unique to them. In ecosystem ecology they put all of this together and, we try to understand how the system operates as a whole. This means we try to focus on major functional aspects of the system and the same for applications. You have hardware, an operating system, middleware and an application then the fine pieces like plugins batch jobs etc. that make them all connect.
100 plus servers can make up a single application and you need to play detective sometimes to figure them out. This is also where having allies throughout the other departments on your side can help you out. How often have you seen a situation where a sales guy downloaded a third party app on his laptop that feeds all of the company reports or that server that been in development for two years really is a production server? That is why it’s always good to be uphill in a state of readiness. There are always special company factors that limit abilities and hamper with collaboration but how you handle it next will be your key to victory.
Step 3. Planning and Teamwork
As mentioned above the Importance of BIA for the business is key but if your company is smaller or doesn’t require one then its up to the IT department to figure out what the business needs. You may not be able to figure out the financial impact of the applications being unavailable but you can work back by deciding how your company makes money. Just sit back and imagine a giant sinkhole just swallowed your whole company and whether that makes you smile or grin in terror is irrelevant What is going to happen now? This is where stepping outside the box starts to really take place because you need to go to work and work just disappeared and it’s up to you to get it back.
Let’s start with the easiest question where do paychecks come from? This is usually everybody first concern so start with the accounting departments servers and applications I really don’t see anyone having a problem with that. Nobody is going to work for free or help bring back the business if nobody is getting paid its very plain and simple. Next tackle how the company makes money and bills for their services and start to prioritize your applications from there because they can’t pay you with money they don’t have. This is working backwards from the time of the disaster and how to start training your mind to be “DR centric”.
Great, now that you’ve spent a couple of days prioritizing and Tiering the applications that will save your company and business now what? A great way to develop your Disaster Recovery plan is to work collaboratively with the stake holders and department leads of your business. Why you ask, when they have no cares regarding the IT guys. If anyone understands what makes a business run and how they get paid it’s the people in charge of the applications on the business side. Gather your findings into a power point presentation (because oh how do they ever love their power points), and keep it hi level around what applications you believe are the most important.
The reason you should keep a hi level or what I like to call a 30,000 foot view is to keep them engaged because they can then collaborate and start pulling out some third party apps that you may never have been aware of. This is the first step in working with the business to create a plan or objective of recovery. It’s a great starting point to work from and you have now caught the interest of the “business people” and got them thinking more DR centric as well.
Why is that important you ask? Two words, IT Budget. They now have some skin in the game and when you start working backwards some more into how to actually recover these apps then they already are invested by helping figure out what needs to be done. Don’t get me wrong it won’t guarantee you an immediate hi end failover site or new facility but people will be much more accommodating towards you to help come up with a solution now rather than before when you asked randomly for new equipment. Collaboration is key with DR and you will make new friends in the process.
Step 4. Standardizing the recovery
How to standardize your DR goals is what to do next. We in the field work with accepted definitions/standards of application recovery which are RTO, RPO and RTA. This gives us a standard and a goal to obtain and will level set DR with the business. These three acronyms set the stage for forecasting, budgets, and accepted levels of loss. Without these definitions and the listed and organized application recovery list you can be lost.
- An RTO is a recovery time objective – which means the amount of time that is expected for you to recover the application. If the server goes down on Tuesday at 8PM when do you expect that application to be back functioning?
- An RPO is the recovery point objective – which is the point in time that the applications are expected to be recovered to. So if the above mentioned server goes down at 8 PM then the company wants the data on that server to be from 7:50 PM “accepted amount of loss”
- An RPA is a Recovery Point Actual – and that is an actual time that the application was recovered in the past. Whether it was a DR test or an actual disaster someone kept the metric of how long it took to bring it back online
We can now start to wade our way out into standardization and setting goals and objectives from the above with the blessing from the business. This is where we need to start and how to be successful in DR.
An Important formula in DR is the higher the $ spent the faster the RTO
Step 5. Validating the DR
So now that we have our executed on our goals the question becomes how do we validate that the recovery was successful? Two words user acceptance. Before we look into what users expect let’s take a minute to jump outside the box again and look at application validation prior to turning over control to the business. Have we
Some things you should be looking for is what can we do to validate that an applications been recovered? Let’s start with can the application still function like it did before? Besides the obvious vital signs (the application actually starts, the services are running and the login is successful) can it still send fax invoices and upload data to your hosted applications that utilize payroll etc? Some of those functions are that last step in validating the application such as having things like BROOKTROUT modems, printers to complete an order, or a fax pots line etc.. Can the DMZ facing applications report or update the internal servers?
Once those processes are verified we can look at the user acceptance. User acceptance is when the actual day to day users of the applications and data entry folks can log in and process real data. Make sure that your VPN is working and able to accommodate remote users to do the testing. In the chance that the sink hole experience played out it’s very unlikely that your office will be there and people will have a desk to sit at.
With these steps complete and the business is now once again able to make money we can consider the DR is completed.
Pingback: Build a Better Backup Design | Erik Krogstad's Blog
Pingback: Hacking ESX Storage with cheap iSCSI target