Here’s the thing. I have been working in ICT for too long now and in my time I have seen that a solid backup routine can be the difference between a company closing it’s doors, and a couple of emails about why the server is running slow this morning.
I am technology agnostic. I have seen some really good products out there that do everything they need to, and some really bad products out there that only form part of the solution. Please don’t read this document looking for a solution that’s right for you, read this document with a view to being able to pick solutions that will work for you.
“Oh ICT with their acronyms. tis an embuggerance of a thing.” — Me, this morning
Terms in ICT are there to give people shortcuts to explain commonly known concepts to each other. Unfortunately, in an effort to position products to people who don’t fully understand ICT (and that’s most of us since the knowledge domain grows daily) companies have been calling their products things that sound technical but are really brand names for features. This glossary, like me will not be using brands.
- Archive: A repository of data you have to keep for legal or sentimental reasons.
- Author: John Matthews of Champions of Change. email: john (at) champions.tech
- Backup: An overarching term for all of the technologies used to keep your data safe.
- BC: (Not Before Common Time) Business Continuity. What do you need to keep the business running? Venues, Tools, Services.
- Bin: A repository of data you should destroy for legal or sentimental reasons.
- Champions of Change: My current employer. The beatings with the scented shoelaces do get annoying, but the coffee is good.
- Cluster: A group of devices that do the same job to spread the load or to swap over in case of failure.
- Documentation: It’s very sexy to have everything electronic. It’s not so useful when you don’t have power. Have a copy of instructions in printed form Off site.
- DR: Disaster Recovery. ARG!!! The site has gone down and now we need to build everything from scratch. How long will it take to get working again, and what do we do to make it happen?
- Five Nines: A high standard of reliability. It means the device (or collection of devices) will work 99.999% of the time.
- Full Backup: A complete replica of the data.
- Incremental Backup: A partial replica of the data, usually covering any changes from when the last Full Backup happened and Now().
- NAS: Network Attached Storage. A big pile of hard drives that thinks its a file server.
- Off site: A place where your place of business isn’t. Used for storage of things just in case “bad things”(TM) happen at the office.
- RAID (0,1,5,6,10,12): a method of quickly copying data between hard drives on a computer which ensures the the data wont’t be lost if one drive dies.
- Recovery: Finding out what is left after the disaster and getting it operational.
- Replication: The gentle art of copying from hither to yon.
- ROI: Return On Investment. You spend money on stuff. You want to make it back. Straight up, Backup won’t make you money, the same way insurance won’t make you money. But in event of “Bad things”(TM) it will save you money. So don’t let the sales droids tell you otherwise.
- Rollback: Restoring data to how it was at a certain point in time.
- SAN: Storage Area Network. A big pile of hard drives that can be connected to one or more computer servers.
- Snapshot: A copy of all your data that has happened before a certain point in time.
- SOL: Something outta Luck. I cannot for the life of me remember the first word.
- Tape Storage: Lots of data needs to be moved onto media quickly. Tape drives have traditionally done this job well. They are basically a long strip of magnetic media on a spool. There is a whole infrastructure for backup and restore from tape such as tape libraries, Loaders, and readers.
- Workstation Backup: A stop-gap solution to having corporate data on a workstation rather than on a central server. It copies the data from the workstation to somewhere else to keep it safe. I have a low opinion of workstation backup systems for anything larger than 5 workstations.
How to spot a good working backup system from quite a long way away.
“Data want’s to be anthropomorhised.” — Some guy on the internet.
So you have been told a backup system is a “good thing”(TM) and want to put one together. How do we go about it? First step is to plan. Here are some handy questions to ask. But first step is to be realistic and be honest.
What is important to me?
If you were running from a burning house and could only take one thing, what would it be? Normally it would be “People, the photos, and the important documents.” or a variation of these. Armed with these three things, you can rebuild your past and present, so you can go into the future.
The same is important for a business. They need that corporate knowledge and corporate memory to rebuild. They need the information to know how to get onto the insurance company and what they should let burn because it isn’t worth the effort of saving.
Make a list.
How long can I survive without a tool?
If you ask a builder how he would get on building without his toolbox, the answer would be “I’d need to run down to the hardware store to pick up a hammer, saw and chisel.” But he wouldn’t be much of a builder if he didn’t carry a spare of every of his more used tools.
If I lose my workstation, I can with some inconvenience, shift over to my phone for a lot of tasks. I have organized my data to be available if I log on from an internet cafe. My needs are simple. What are your companies needs?
How long can I survive without this data?
It may cost you more money to lose an hour without data than to buy a whole second set of infrastructure. In which case, sit down and work out how much you are going to lose if you lose any given piece of equipment. Also factor in set up costs.
NOTE:- Having a cluster of two or more devices may make you feel secure but will up the costs in setup and maintenance. Be aware that there may be some false economy in doubling up on parts. You may be better served by having Two of a device and keeping one in the back room ready to put into place if the first one packs in.
What sort of communication strategy should I have? Who needs to know what?
People will find out. It’s your choice to look honest or shifty. You won’t be able to make it look good. Stop trying. Tell those that need to know early and if at all possible, give them a time to full operation. This will be the time it takes to do the task. Practice the recovery task once in a while. Practice fail over once in a while. If you don’t have metrics on it, odds are you have not practiced for it.
Do you really want the first time you do the process to be for real?
Building your first backup system.
“Can We build it? Mmm, yeah I think so.” — Lofty from Bob the builder
Now you have your specs, let’s put something together to meet them. An ideal backup system holds your data in at least
- three places
- across two media types
- with a single reference point to get it all back.
You should be able to use it to retrieve
- a single change to a single file from the past
- a single file from the past
- all of the files from the past.
Why? Because there are two things against you. The first is people who don’t want you to have access to your data. This might be industrial espionage, some script kiddy in Kerblakistan or an angry (soon to be ex) employee.
The second is entropy. Don’t pretend that the universe, luck, or even your fat fingers are on your side. Things will go wrong. Accept this and guard against the most likely things that can go wrong. (Buildings burn down often. The Zombie Invasion hardly ever happens. Work out which are the threats to you.)
So. for my laptop that I’m typing on right now, I have a copy of this file saved to my internal Hard drive. I have a copy saved to a cloud provider, and I have a copy saved to a pocket USB stick.
This means that I have the file in three different places (More, since my cloud provider saves it in multiple data centres.) I have it on Three media types. (My HDD, whatever the cloud saves it as, and USB stick.) Should I lose the file, I can grab one of the two remaining copies and be back typing in minutes from a new device.
This system works well assuming… that nothing corrupts my data during a save to my three locations.
So let’s pretend that a “Bad Thing”(TM) happens. My original file becomes corrupt.
I go to one of my backups. That’s corrupt too.
I hope like hell that the third one is not corrupted. But because I saved to all three locations at the same time I am SOL. This is where I need versioning or snapshotting.
Fortunately, my cloud provider offers a service where it remembers all changes to my document for the past thirty days. Hooray. I can go back to yesterday’s copy and I’ve only lost 24 hours work. Not ideal, but better than losing the lot. I can convince my boss that I need another 24 hours to complete the project. It would be a harder sell to tell her that I need to start again from scratch.
On most document archiving systems, this feature is available. Use it. It also allows you to maintain a rolling log of what the document is up to.
In most code repositories this is available and useful. Again, it allows you to log what is going on with a file or set of files and roll back as required.
In most file stores such as SANs or NASs, there is a function called snapshotting that, while using up a little more hard drives, will do this for ALL of the files on them. Definitely worthwhile doing.
Failing all else, go old school. Copy all of the data that is important to you to another medium (USB Hard drives aren’t that expensive, Tape drives take a bit of setting up but are worth the effort for data centres.) Once the copy is made, you can go back to that copy as your versioning.
Snake Oil, and how to avoid it.
“You can fool all the people some of the time, and some of the people all the time, but you cannot fool all the people all the time.” –Abraham Lincoln (allegedly)
All systems need human interaction. If you aren’t checking the receipts and logs whenever the job runs, then you won’t know when the system breaks. This is your responsibility since you have the most riding on it. If you do outsource this to someone, ensure your metrics for success are being met.
Workstation backup for more than 5 workstations is inconvenient at best and delusionary at worst. Stick your data in the middle where everyone and everything that has need to, can access it. A big old workstation with lots of hard drives is better than hoping that people remember to do their backups.
If I can’t recover from a month ago, I don’t have a backup system. I have a replication system. Ask the questions on install. Get them to demo it.
If I can’t rebuild what I have now, quickly and with only minor disruption to the business, I don’t have a Disaster Recovery system. Get the timings quantified and hold them to it.
Anything sold as business continuity should be sold with an up time guarantee. That is “It will be up 99.9% of the time” ish. Do the maths and work it out.
95% = down for 438 hours a year
99% = down for 87.6 hours a year
99.9% = down for 8 and 3/4 hours a year
Five Nines or 99.999% = 8 minutes a year.
Determine if you are really a 27/7 business or if you are really a 9 to 5 with a little out of office needs.
Are you seasonal? If you farm, you may need 99.999% up time during the harvest, and need only 9 to 5 support the rest of the time.
Ensure that you know the guaranteed up time on all the components collectively. It’s all very well to have a cloud provider guaranteeing Five nines reliability, but if your Internet provider will only give you a 95% guarantee, then your up time is actually a smidgin less than 95%. (The formula is the fractions of all components multiplied together)
ISP up time = 95%
Cloud provider = 99.999%
Total up time percentage = 0.95 * 0.99999 * 100 = 94.999%
Some simple examples
“Begin at the beginning,” the King said gravely, “and go on till you come to the end: then stop.” — Lewis Carroll
Here are some basic scenarios that are designed to help people work out what their needs are and how to meet them. All scenarios are based on solutions that I have provided to customers at Champions of Change. I have changed the names in an effort to protect their identity.
Real Estate Kerblakistan
Real Estate Kerblakistan is a company employing 6 people in three separate business units. Most of their data is contracts and purchasing processing. They only operate during business hours and have limited need to share documents with each other.
They don’t have a server, but don’t see their business model growing any time soon. So there is a need to ensure their data can be shared, and is safe.
They determine their biggest risks are
- Cups of coffee being spilled on workstations
- Rogue employees doing evil stuff to their data. (Deletions or copying.
- A cloud Customer Relation Manager with built in Document Versioning.
- Three Workstation backup drives that are rotated weekly and used to back up the Documents directory on the notebooks and workstations.
In case of emergency.
- Buy new laptops.
- Get internet connection
- Install basic software.
- Restore files from USB Hard drive.
Total time of emergency, 6 hours. Total time lost… up to 14 man hours.
Lizard Spock Creative
Lizard Spock Creative is a media company specializing in creating print ads for companies. They have Terabytes of fonts, photos, mood boards and publishing files. They have a smallish repository of accounts and HR data. They have two offices in separate countries.
Their biggest risks are:
- Losing graphic libraries used to build their products.
- Having a long distance between offices.
- Rogue employees.
- Running not widely supported workstations.
Pretty much the same as above. Stick the backups in the cloud so both offices have visibility of the Admin data. Back up the NAS to USB for Libraries, current projects and Admin.
Ensure that all data is saved to the server via policy and convenient links. It’s easier to back something up if you know where it is.
In case of emergency,
- Buy new workstations and network kit.
- Make backups available from the cloud until new hardware is in place.
- Buy new NAS and local network kit.
- Use USB to replace NAS.
- Use Cloud storage to do an incremental restore.
Total time of emergency 2 days (30 workstations takes some time to buy and set up.) Total time lost… 3 days + 2 weeks of slow service.
Thanks to my boss for the time to write this doco. It is by no means a bible for Backup and restore, but I hope it serves as a gentle intro to the concepts to allow you to evaluate the many great products on the market.
If you have any questions on anything in the document, feel free to get in touch and we can help you design a backup solution that is right for you, your business, and your budget.