Backup plans for the paranoid academic
17 July 2019
Tweet
I dread losing my work or my data. And I dread it happening to others too (especially those in my research group). So I thought I’d share a simple approach to backing things up that makes it (very) unlikely that you will lose your work or data.
I am risk averse. So here’s a risk-averse person’s approach to figuring out how many backups you need. Let’s assume that any geographical location there’s a ~5% chance of some backup-killing catastrophe occurring in a given year. I’m thinking things like a fire burning down my building at work, a huge storm drowning my office, dropbox going bust or getting hacked, etc. Obviously 5% is conservatively high, but that’s the point. The calculation is simple - for every separate geographical location, just multiply the risks together to get the overall probability of losing your work in a given year. E.g. if all locations have a 5% risk per year, then having backups in two locations gets you down to 0.25% per year, and having backups in three locations gets you down to 0.0125% risk per year. So, if I have backups in three separate geographic locations (far enough apart that the same catastrophe won’t affect more than one of them), I would expect to have to wait 8000 years before losing all my data. 8000 years sounds good to me. That’s about 250 times longer than I expect to be a practicing scientist (assuming I’m not caught up in one of the catastrophes).
I don’t think this really matters, so long as all of your backups occur regularly and automatically, and they are in separate geographical locations. But here’s what I use:
1. Time machine on a local drive in my office. It’s super convenient, so when I realise I’ve accidentally deleted the wrong file it takes me less than a minute to get it back.It’s also free except the negligible cost of a hard drive which has to be replaced every few years. When my laptop dies, I just plug the new one into time machine and typically lose absolutely none of my work.
2. Dropbox. All of my files, bar a few documents I don’t want to store on the cloud, are on dropbox. This costs me $10.99/month for 2TB, which is larger than the hard drive on my laptop. Dropbox is automatic, super quick, and can be accessed from anywhere. It let’s me sync my work seamlessly between my laptop and desktop, and also has a time-machine like feature. The only limitation is that dropbox can’t store some of the really huge data files I produce for my research.
3. Crashplan. This is also $10/month. It’s not as convenient as dropbox for day-to-day working, but it has unlimited storage. I use Crashplan to back up everything on I have on dropbox, as well as all of my raw data (currently ~50TB). (Note: these big raw data files are sequencing data, which are also held by the sequencing centre for a while, so that’s the third backup for them). Crashplan has one other benefit, which is that it backs up all my GitHub repos (GitHub doesn’t play nice with Dropbox). Like the other two backups I use, Crashplan is set-and-forget: it just watches for files to change on my desktop or on the big raw-data drives I have, and when something changes it starts backing it up. Finally, It has a time-machine-like feature that goes back further than time-machine or dropbox. I’ve used it a couple of times when I realise I’ve lost a file from years ago that I want back.
Three independent backups might seem like overkill, and it probably is. But for the price of about a coffee a week I know that all of my work and data are safe, and I never have to worry. For me, it’s worth it. I like sleep.
I stop backing up data once it’s publicly archived. As a lab we make all of our data publicly available as soon as we can, which is often months or years before we publish on it. Once data are uploaded to e.g. NCBI or FigShare (or anywhere with a DOI), I no longer back it up religiously. We’ll keep copies on servers as long as we’re still using it for analysis, but as long as the data are properly archived, I figure my own backups of it are redundant. This approach is useful in lots of ways - it means that we get the data publicly archived soon, and we do everything we can to make sure the archived copy of the data and/or code contains all the information: all the metadata, all the analysis code, anything at all associated with the original data. And it means that when we want to build on our own data or analyses, we start by downloading them from the public databases just like anyone else would. Over time, this helps us improve how we archive things - it’s amazing the simple things you can overlook that cause headaches later.
I dread losing my work or my data. And I dread it happening to others too (especially those in my research group). So I thought I’d share a simple approach to backing things up that makes it (very) unlikely that you will lose your work or data.
Why 3 backups
I am risk averse. So here’s a risk-averse person’s approach to figuring out how many backups you need. Let’s assume that any geographical location there’s a ~5% chance of some backup-killing catastrophe occurring in a given year. I’m thinking things like a fire burning down my building at work, a huge storm drowning my office, dropbox going bust or getting hacked, etc. Obviously 5% is conservatively high, but that’s the point. The calculation is simple - for every separate geographical location, just multiply the risks together to get the overall probability of losing your work in a given year. E.g. if all locations have a 5% risk per year, then having backups in two locations gets you down to 0.25% per year, and having backups in three locations gets you down to 0.0125% risk per year. So, if I have backups in three separate geographic locations (far enough apart that the same catastrophe won’t affect more than one of them), I would expect to have to wait 8000 years before losing all my data. 8000 years sounds good to me. That’s about 250 times longer than I expect to be a practicing scientist (assuming I’m not caught up in one of the catastrophes).
What 3 backups
I don’t think this really matters, so long as all of your backups occur regularly and automatically, and they are in separate geographical locations. But here’s what I use:
1. Time machine on a local drive in my office. It’s super convenient, so when I realise I’ve accidentally deleted the wrong file it takes me less than a minute to get it back.It’s also free except the negligible cost of a hard drive which has to be replaced every few years. When my laptop dies, I just plug the new one into time machine and typically lose absolutely none of my work.
2. Dropbox. All of my files, bar a few documents I don’t want to store on the cloud, are on dropbox. This costs me $10.99/month for 2TB, which is larger than the hard drive on my laptop. Dropbox is automatic, super quick, and can be accessed from anywhere. It let’s me sync my work seamlessly between my laptop and desktop, and also has a time-machine like feature. The only limitation is that dropbox can’t store some of the really huge data files I produce for my research.
3. Crashplan. This is also $10/month. It’s not as convenient as dropbox for day-to-day working, but it has unlimited storage. I use Crashplan to back up everything on I have on dropbox, as well as all of my raw data (currently ~50TB). (Note: these big raw data files are sequencing data, which are also held by the sequencing centre for a while, so that’s the third backup for them). Crashplan has one other benefit, which is that it backs up all my GitHub repos (GitHub doesn’t play nice with Dropbox). Like the other two backups I use, Crashplan is set-and-forget: it just watches for files to change on my desktop or on the big raw-data drives I have, and when something changes it starts backing it up. Finally, It has a time-machine-like feature that goes back further than time-machine or dropbox. I’ve used it a couple of times when I realise I’ve lost a file from years ago that I want back.
Value, peace of mind, and open data
Three independent backups might seem like overkill, and it probably is. But for the price of about a coffee a week I know that all of my work and data are safe, and I never have to worry. For me, it’s worth it. I like sleep.
I stop backing up data once it’s publicly archived. As a lab we make all of our data publicly available as soon as we can, which is often months or years before we publish on it. Once data are uploaded to e.g. NCBI or FigShare (or anywhere with a DOI), I no longer back it up religiously. We’ll keep copies on servers as long as we’re still using it for analysis, but as long as the data are properly archived, I figure my own backups of it are redundant. This approach is useful in lots of ways - it means that we get the data publicly archived soon, and we do everything we can to make sure the archived copy of the data and/or code contains all the information: all the metadata, all the analysis code, anything at all associated with the original data. And it means that when we want to build on our own data or analyses, we start by downloading them from the public databases just like anyone else would. Over time, this helps us improve how we archive things - it’s amazing the simple things you can overlook that cause headaches later.