Thursday, February 3, 2011

Automatic Offsite Backups

I run a website which currently hosts approximately 300GB (over 1000 files) of training videos on a shared hosting provider. We increase that by ~50 video files per month (~20GB). Currently our backups have been on the desktop machines of our staff, however I'd like to set something up that is more automated. I will be looking into other hosting options, but in the meantime, I would like opinions/improvements regarding the following plan for backups on this server.

There are two types of files that will be backed up. The first is the video files described above. These only need to be backed up once per file, as they will never change. The second type to backup is files from the site itself. These should be backed up regularly and tracked for revisions. Most of the changes here will not be coding changes, and the staff making the changes are 1) not technically inclined and 2) distributed throughout the US. I don't think that an svn-based solution will work well given these facts.

So here's what I am thinking:

  1. Create a DB table to log backups. This table will include: hash of the file, modification date, size, date of backup, local path (at the time of backup), and path to the remote version of the file.
  2. Use a script running on a cron job to regularly (daily? weekly? monthly?) navigate the directory structure to identify files that have not been backed up. This identification can be done by comparing the hashes.
  3. After identifying the files that need to be transferred, the script will ftp them to the remote server. After each file is transferred successfully, a record of that transfer will be inserted to the DB.

Do you see any problems with this approach? Will I run into issues the first time the script executes, due to the large amount of data to be transferred during the first go-round?

  • Sounds like a decent idea to me, although I think you might be reinventing the wheel here a bit since I'm sure there is backup software out there that would cover your needs.

    As for backing up the site source code - wouldn't that be something better left to version control software?

    JGB146 : Marking this as accepted b/c I'm moving ahead with my plan since no other answers fit my situation (i.e. shared hosting environment w/ no su access + don't have S3 so acquiring it fits more into the realm of upgrading our hosting, I think)
  • Backup softwares that can exclude certain type of files exist (sorry I can't give you software names today, it's Bastille Day here and my colleagues aren't around :) ). It would permit you to backup separately huge files (videos) and the common files.

    As for DB table: I wouldn't rely on such a complicated thing in case of emergency like a disaster. I'd only rely on human readable plain text files. You don't know how bad the case will be, apart that you've an offline backup hard disk from which you must save the world, your company and your a**. In this case, you can mount the HD and open a text file in a few seconds when it'd take a few minutes or longer to extract data from a DB table (if it isn't damaged) when you've better things to do and think of.

    Intervals of time: daily diff and complete backup once a week or twice a month seems reasonable and sufficient to me (I work for a web agency, not a bank). YMMV.

    We try to keep many copies of the same file in completely different places, but still knowing which are the more recent files. What would you do if the backup hard disk crashed with the machine it was connected to? If you didn't have a second copy of this HD, then you're in trouble. Family or friends houses are great places to store encrypted disks, just in case. Then you've to manage passwords and the people who know them. Parents, husband/wife, boss, best friend, etc

    EDIT: isn't it a question for ServerFault.com?

    JGB146 : This might be better suited for SF. When I started writing it, I expected to go more on the content of the script performing the backup operations, and to create a separate SF question about the hosting. That wasn't how it turned out by the time I was done writing.
  • My personal solution to something similar is S3 and git.

    First, sync all the videos to S3. Note that this also provides some amount of backup to your website since you can serve the files straight from S3 as well, should the need arise.

    Second, put all the files 'from the site itself' into a git repo, and whenever you want to do a backup, do a commit and then put a copy of the .git dir on S3 as well. Note that no one but you has to know how to work git.

    This gets you a simple duplicate backup of the videos and a more complex timeline-based backup of the site. And of course, though I use S3, you could equally well use Dropbox or a remote host or whatever.

    From pjz
  • Your application sounds common enough that I wouldn't recommend investing the time in rolling your own solution.

    Something like rsnapshot could take care of your versioning needs (provided the destination machine has enough disk space, of course) w/o having to reinvent the wheel as you are re: your "backup database". You'd need to use the rsync protocol, rather than FTP, but you'll more-than-likely end up with less data traversing the wire using rsync anyway.

    If you want to be a bit more bleeing-edge, you might give FSVS (Fast System VerSioning) a look. It's a backup system that uses the Subversion back-end to store files and track versions, but doesn't require end users to interact with Subversion.

    JGB146 : I think the shared-hosting provides some problems here. From my scans of the rsnapshot docs, I'm concluding that rsnapshot requires su. Am I wrong?
  • I have one word for you my friend: rsnapshot

    It does everything you listed above with the added bonus of you not having to write a single line of code. It only backs up changed files-so after your initial huge backup it will only backup new/changed files. It runs very fast and is easy to get up and running.

    JGB146 : I think the shared-hosting provides some problems here. From my scans of the rsnapshot docs, I'm concluding that rsnapshot requires su. Am I wrong?
    Josh Budde : It does not require su as long as the user you're connecting as has appropriate rights to read all the files you want to backup.
    JGB146 : http://rsnapshot.org/howto/1.2/rsnapshot-HOWTO.en.html states in each installation instance that I can find that one of the steps is to `su` and then execute a `make install`. Is there another option that I'm not seeing?
    Josh Budde : Well the way I would handle it is to run rsnapshot on a machine at your office and 'pull' the backups. Thats how I handle it anyway. But the make install step is only for convenience-it takes the unpacked files and copies them to default locations. Its not necessary.
    From Josh Budde

0 comments:

Post a Comment