Thoughts on data safety

Posted by | Comments (3) | Trackbacks (2)

Data safety is very important but often neglected. How important keeping one's data safe really is gets often recognized when disasters already happened. This article covers a few ideas how to prevent data loss even if your hard drive crashes or your house is on fire.

Redundancy and why RAID is not a backup

At first let's consider making our data secure from disk failure. For this purpose we can use an array of two or more disks (RAID) to mirror our data several times. I won't explain how RAID works and how to use it, that's been done a thousand times before but let me reveal some stumbling blocks.

The simplest form of RAID is RAID-1 (okay, there's also RAID-0 but that doesn't provide any redundancy thus it doesn't apply) which just mirrors disk contents between two or more disks. RAID-1 is very primitive but fairly effective since there's no big risk of data loss due to disk damage. But be aware that RAID-1 looses its strength when rebuilding the array. If you remove one disk out of two you have no redundancy left until the array is rebuilt entirely. If something bad happens during this time you loose everything. Therefore you should consider using at least three disks for RAID-1. If two disks fail your data is still recoverable.

Delusive safety of RAID-5

One big disadvantage of RAID-1 is its low capacity. Because you only mirror your disks you waste a lot of capacity. Regardless of how many drives you insert into your array the size will still remain the same of one single disk. RAID-5 is a compromise of redundancy and capacity. RAID-5 stripes between three or more disks to increase performance as RAID-0 does but mirrors data.

Today RAID-5 is commonly used but you should know about one great danger. RAID-5 is built for reconstructing data if exactly one disk fails but doesn't care about bad sectors on your hard drive. When one of your disks crashes and you insert a new one your array is being rebuilt. But now if one sector on one of the other disks has errors the rebuild terminates with an error, your data is lost. Modern SATA disks have an URE (unrecoverable read error) rate of 10¹⁴. That means that statistically such an error occurs every 100,000,000,000,000 bits, that's about 12TB. If you have a RAID-5 array with overall capacity of more than 12TB (e.g. 7 disks at 2TB) it's very likely that one of your hard drives will have an error which causes your RAID rebuild to fail. More expensive server disks and most SCSI devices have a better URE of 10¹⁵ which is about 115TB.

An alternative to RAID-5 is RAID-6 which protects you from failure of two drives but with increasing capacity of SATA disks also RAID-6 will be unreliable in a few years. Instead of RAID-6 you can use RAID-10 which is RAID-0 on top of several RAID-1 arrays. At least if you have more than just two disks in each RAID-1 array this method is relatively safe and reliable.

RAID is a good way to keep your data safe from disk failure but please never forget that this doesn't mean that you can't loose data. RAID only protects you from hardware damage but not from viruses, bad guys, file corruption or accidental file deletions. All changes made to your data are mirrored to all other disks in your array immediately. Therefore RAID does not replace backups but is a good way to reduce risk of data loss due to hardware failure.

Linux and RAID

Linux supports creation of disk arrays natively but you can also use hardware based RAIDs. Whether you decide in favor of hardware or software RAID is up to your needs. Linux software RAID is cheap and very fast but on some more advanced types of arrays you might want to use a separate controller to save CPU cycles. In this case you should not choose the cheapest controller you can get hold of. These are often slower than software RAIDs. Also your motherboard's native RAID “controller” is not a real controller but rather fake hardware RAID (also works with Windows only). For normal home use you wouldn't feel a difference between hardware and software RAID(-1). Only RAID-6 could have a noticeable impact on your system's performance, depending on the amount of disks you have.

Linux RAID does not natively support RAID-10 but that doesn't mean you couldn't use it. RAID-10 is nothing else than LVM on RAID-1 but also Batrfs or ZFS could be used instead of LVM. Yes, really! ZFS has a native Linux port now, not only a FUSE implementation but a real kernel module. See zfsonlinux.org for more details. At this time LVM might be the recommended way for Linux but once Batrfs or ZFS are stable enough for productive use you should consider switching to these if you don't have a good reason for sticking with complicated LVM.

Backups

As mentioned above you still need to back up all your files even if you use RAID. There are hundreds of thousands of ways to get this done but let me quickly outline a few very powerful tools.

Good old-fashioned dump/restore

dump and restore are two of the oldest and most proven backup utilities for Linux. Originally invented for backing up on tape you can also use them for backing up to normal files. dump is able to create complete as well as incremental backups of your Ext2/3/4 file system (similar tools also exist for some other file system types). For this it provides 10 backup levels (0-9) where 0 is full the others incremental backups. Current versions of dump also support levels above 9. Each level of 1 or above only backs up what's changed since last backup of one of the lower backup levels. For instance level 3 backs up all changes since the last backups of level 2, 1 and 0. dump can only backup the whole file system incrementally. Single files can only be backed up with level 0.

To back up a file system you first have to unmount or remount it read-only.

mount -o remount,ro /dev/sdg1
dump -0uf /my/backup/file /dev/sdg1

This remounts /dev/sdg1 read-only and writes a backup to /my/backup/file at level 0 (full backup). Parameter -u updates /etc/dumpdates which is important for following incremental backups. Without this parameter you wouldn't be able to do incremental backups at all.

Backup levels were originally invented to reduce the amount of tape needed by saving as few data as possible. Today hard drives are cheap but you can still use these levels to implement several backup cycles. For instance you can do a full backup (0) each Monday, a backup of 2 each week and a backup of 1 each month.

To restore a backup use

restore -if /my/backup/file

That will open an interactive restore shell for you. Type in help for usage instruction. Exit with q.

dump over SSH

Never keep backups on the same drive, nor in the same computer nor in the same room or even house. It's always a good idea to back up very important data on another place on earth. Be it a friend's computer or a hired server somewhere on the globe.

To move your backups securely over ethernet to another machine use SSH. Make sure an SSH server is running on the remote computer and it's configured for key auth. Then mount the remote backup drive via SFTP and copy your backup file. To ensure your file is written correctly compare output of

ssh user@host "md5sum /backup/file"

with the MD5 sum of your local file.

Since you've set up key auth for SSH you can easily create and transfer your backup via Cron.

Amanda

Amanda is a mighty but handy tool which is based on dump/restore but also on some other tools like tar. It's a very sophisticated piece of software used by many administrators. Whereas my tool of choice is rsync (see below) many prefer Amanda that's why it's mentioned here. It's definitely worth a look.

For those who like it transparent: rsync

My favorite tool is rsync. It does its job fairly well and produces very small backups. rsync generally creates incremental backups but one great advantage is that they behave like full backups. Once rsync has performed a full backup only files that have changed are backed up, all other files are just hardlinked from the previous backup. Therefore you always have full copies of all your files and folders but they're all deduplicated and don't consume more disk space than needed. To create a backup of a folder run

rsync --one-file-system --archive  --numeric-ids \
    -link-dest=/backups/old/ /source/directoy/ /backups/new/

This backs up /source/directory/ to /backups/new/. Unchanged files are just linked from /etc/backups/old/ and don't take any additional disk space. Also required are the parameters --archive which tells rsync to make backups and --numeric-ids which ensures that file ownerships are not messed up with the destination computer's user configuration. --one-file-system is optional and tells rsync to not include other file systems mounted under /source/directory/. There are also lots of other parameters to fine-tune rsync's behavior. Just have a look at the man page.

The example above just creates a copy on the same computer (or a mounted remote drive) but of course you can also specify a host name where to push the backup to (or where to pull the files from to save the backup on this local machine). When your hostname contains a colon (:) which divides hostname and file system path a remote shell is used which should be SSH (if you didn't specify another shell with -e or --rsh=CMD) but you can also prepend rsync:// as protocol prefix to use a rsync server (could be run via xinetd). Also note that there's no need of verifying file integrity since rsync does this automatically for you. It's a very decent tool.

Snapshots with ZFS and Btrfs

Both ZFS and Btrfs do not only support logical volume management but snapshots too. Also Linux' LVM supports snapshots but LVM is an additional layer whereas ZFS and Btrfs have native file system level support for this which is quite handy. Snapshots are handled as subvolumes. To create a snapshot in Btrfs run

btrfs subvolume snapshot /mnt/mybtrfs /mnt/mybtrfs/mysnapshot 

Mount the subvolume with

mount -t btrfs -o subvol=mysnapshot /dev/sdXY /mnt/mysnapshot

ZFS and LVM handle this in similar ways, no great deal. The advantage of snapshots is that they're created within seconds because they're just hardlinks not real copies. Only maintaining the differences to the current file system state takes additional disk space. Therefore you can also delete subvolumes/snapshots with btrfs subvolume delete <name>.

By the way: snapshots are also the preferred way for backing up databases. Flush and lock up all tables, create a snapshot of the database files and unlock the tables again. That's much faster than dumping all your database contents to text files and much safer than just copying the files (you should never ever do that!).

So how to use this for backup? That's easy, just copy the snapshot to your backup drive and then delete it (the snapshot, not the backup wink). But there's an even better method…

The ultimate coup: rsync + ZFS/Btrfs

That's absolutely ingenious! But hold on, we do it the other way round. We do not create snapshots and copy them over but we copy files and version them with snapshots. Btrfs and ZFS empower rsync to back up databases but they can also be used on the other side. Remember that rsync creates hardlinks for redundant files? That's a huge effort and takes some time. What if we just created a snapshot and then copied the new files directly over the old ones? That's not only faster it also speeds up deletion of old backups. Isn't that great?

Conclusion

Redundancy in combination with regularly performed backups is absolutely essential and Linux has a lot of simple but powerful tools to accomplish this. One last thing to say: you don't have to save all your backups until the cows come home. But you should always preserve the last few full (!) backups and their incremental additions not just the very last one. In case you have messed up files and didn't notice it you'd be glad having an even older backup than just the last one. And yes, that's also a great question: how do I ensure that my backup does not contain corrupted files? The answer is: you can't except you want to browse all files manually to check them. You can checksum all your files but that has no effect if your original file is already broken. Thus it's important to keep some older revisions of your files. How many of them you actually keep is completely up to your budget, your disk space and of course your preference. The more your have the safer your data are but the more hard drive space you need. Always ask yourself: “How important are my files? What happened if they'd get lost?” That's how you calculate risk.

Trackbacks

Comments

There have been 3 comments submitted yet. Add one as well!
Basti
Basti wrote on : (permalink)
Nice article (as always). Just a few quick thoughts: - be careful with rsync or you risk data loss (dry run with -n first) - other nice backup tools: backintime (really mighty tool), mintbackup I run backups over here with backintime towards my nas. Nice raid explanation, learned a lot. keep on blogging :)
Janek Bevendorff
Janek Bevendorff wrote on : (permalink)
Hi, thanks for your comment. All the others were just spam caught by TypePad. :D But these are indeed good additions. Dry run is always recommended.
Andrwe
Andrwe wrote on : (permalink)
Hi, nice article and a good explanation about advantages and disadvantages. I just want to add that the described methods aren't really good for backing up the system itself just in case there is someone out there who doesn't think all through and tries to use one of the methods mentioned to backup his /. On this page http://www.seiichiro0185.org/linux:backupscripts you can find some scripts for backing up your home using the mentioned rsync-method and your system. I'm using both and they've already proven themselves. So have fun and backup your important data. :D

Write a comment:

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

By submitting a comment, you agree to our privacy policy.

Design and Code Copyright © 2010-2024 Janek Bevendorff Content on this site is published under the terms of the GNU Free Documentation License (GFDL). You may redistribute content only in compliance with these terms.