Create a Backup Design
The image above can be a backup administrator’s worst fear, as the sun starts to rise on your building it will soon be full of light and people. Why is this bad you ask? Well most of us in the IT world all know that the backups that have been running overnight aren’t always finished and to avoid performance issues the backups sometimes get cut off early. To us IT folks we call it The Race to Sunshine that is what we describe the Backup design that runs all night into the next morning. You are actually trying to race the clock to back up as much as possible before the sun comes up and you need to kill the jobs.
Let’s face it, there is a LOT of people who think that most of the data was backed up and that there’s an acceptable amount of loss that can be dealt with. What happens is that you don’t know what you are actually not backing up and that’s the problem. When you are talking about Disaster Recovery some of us think that the Data is back or most of it so we should be good and my job is done. The rest of the company relies on the applications from the servers you just stood up only the problem is that one of those small files that didn’t get backup up from a USER folder could be a piece of data that feeds a plugin to the application that makes it fail. See my article on applications and recovery planning. You would be shocked and horrified by some of these types of Nuances that can crash an application or render it useless.
Another misconception on backups is Databases. Databases usually are the most critical piece of the computer puzzle and the applications lifeline. Without them you can consider everything a failure even if you recovered 99.9% of your environment because that 99.9% is supported by the Database. Firstly there are many ways to skin a DB backup and it all depends solely on your environment but I’d like to focus on the issue at hand which is the misconceptions around the backup process. One of the major issues that happens here is that the open files mainly the Database or transaction logs that are being written to get pushed to the end of backup of that server or skipped because it is intelligent enough to know to wait until something which is being written to finishes and if the backup cuts short its gone.
What if you have a plugin or a DB snapshot utility you shouldn’t have to worry, right? Well here’s the truth on those; a snapshot may capture one DB at a moment in time that is out of sync with another DB or Application data that is currently behind where your back up is at. This can happen when the server where the application runs and has cached data not in tune with the DB. This ultimately leads to a recovery of one piece of the puzzle and you may have to go back to an earlier date of backup to finish the other pieces. This means at best you can recover the RPO from the last backup you had and at worst not at all…. Whew that was a mouth full.
The cluster and failover solutions are also another misconception in the IT world. Mainly because Database clusters need to be in the same location and fail overs need a nice size pipe for live replication all of which is meaningless if a virus breaks out or your security is compromised in some way. Databases are the first targets and if you don’t know when or how you were compromised in the first place recovering becomes even more complex and strenuous because you don’t want to recover a compromised account or your problem will never go away. Also don’t get me started on antivirus servers crashing themselves and causing recovery problems for operating systems that’s a possible future rant/blog in itself.
Lastly on the negatives – The actual truth is that when you take shortcuts they come back to bite you 10 times harder when you’re recovering. For example backup s that are run as multiplex and multi streams run quicker because they throw data at a group of tapes and don’t care where it goes. When you try and recover systems the data they threw at a group of tapes starts experiencing contention with data from other servers trying to restore at the same time. You see a tape needs to stop, rewind and find the data it will restore and when multiple systems are doing this at the same time you can experience a crawl. This also holds true for VTL’s don’t be fooled.
Visualize sitting on your couch and someone starts knocking on your door (just like a data request) and you need to get up and open the door acknowledge it and then go back to the couch. Now imagine doing this for 100 plus people knocking at your door simultaneously and you can start to understand how you can get tired and slow down after a while right?..
Rule of thumb “However long it takes to backup multiple by 2 to recover it.”
So where to go from here in 10 steps–
The first step in beating daylight for your backups is assessing the situation by analyzing what you already have and then following these tips and tricks.
- Look at the backup server logs and see what backed up, what didn’t and why. You can see if it’s a transaction log from an important database for example or just a user’s word doc from a file server. From there you can decide on which issue can take precedence and move on from there.
- Come up with an issue and resolution spreadsheet to show the higher ups what the issues are, how you flagged it and how you plan to move forward on it. This may get some things cleared off your plate or even some much needed budget approvals based on how dire the situation may be.
- Next look at your vendors best practices for backing up and recovering. They have a wealth of great information online as well as tools (capacity planning) and you can see if you’re doing it right. Plus if things aren’t going correctly you can leverage them for help on it since you have followed their best practices.
- Make sure your backup clients and the EXACT same version as your backup server. This is a huge cause for failure especially Windows system states, if the server is version 8.8.10 for example your client better be 8.8.10 as well. If you don’t have exact matches it rarely ever restores properly and your vendor will tell you the same thing.
- Always dump system state natively from Microsoft to a secondary drive like D etc. and then tractor over it with the backup. This is also the same for Solaris and the UNIX systems that have native utilities like FLAR / mksysb. It takes less backup server time and trust me an OS can back itself up and recover itself better than a third party can.
- What are you licensed to do and what do you have available? Check and see if your backup license allows you to do things you didn’t know before like have a VTL. I find that most times a company has things at their disposal they never knew existed and even sometimes storage that can be leveraged in multiple ways.
- Set periodic database jobs (RMAN example) to dump their logs to a folder on the network that backups quickly so that if your traditional backup runs into any issues you can do a traditional build and recovery from dumps. This method will get you over lots of hurdles when dealing with synchronization issues and locked files while doing your traditional backups.
- Define your backup policy to make sure your highest priority servers are being backed up to different tapes. For example if you have 10 highly critical servers make sure they go to 10 defined Tapes/Drives. You should then add to that with the next 10 servers in order of importance across the same tape/disk pool. What this does is when you are restoring after a huge loss the first 10 servers built then recovered will see 0 contentions while the other servers are being built to start the next wave of recoveries.
- Tier your Disk Storage. If you have hi priority backups running to a development pool of SATA disks chances are you’re not getting reliable response times or speeding up your backup. If you have SAN, NAS JBOD etc. make sure you know the speed these disk pools can read and write at try leveraging some disk benchmark utilities that are widely available from all HW vendors.
- Create multiple backup jobs that can write to the newly tiered system. Since we have now multithreaded are backups and made new assignments having multiple jobs is key to see what’s being backed up and the speed difference you’ve gained in doing so. You can calculate better this way to beat the sunshine and it can show you points of failure and let you determine how to fix the issues from there.
I hope this helps everyone get a good starting point in backup designing. In the future I plan on deep diving on some of these points much deeper but this should put you on the right path to recovery.