Thoughts on data safety
Data safety is very important but often neglected. How important keeping one's data safe really is gets often recognized when disasters already happened. This article covers a few ideas how to prevent data loss even if your hard drive crashes or your house is on fire.
Redundancy and why RAID is not a backup
At first let's consider making our data secure from disk failure. For this purpose we can use an array of two or more disks (RAID) to mirror our data several times. I won't explain how RAID works and how to use it, that's been done a thousand times before but let me reveal some stumbling blocks.
The simplest form of RAID is RAID-1 (okay, there's also RAID-0 but that doesn't provide any redundancy thus it doesn't apply) which just mirrors disk contents between two or more disks. RAID-1 is very primitive but fairly effective since there's no big risk of data loss due to disk damage. But be aware that RAID-1 looses its strength when rebuilding the array. If you remove one disk out of two you have no redundancy left until the array is rebuilt entirely. If something bad happens during this time you loose everything. Therefore you should consider using at least three disks for RAID-1. If two disks fail your data is still recoverable.
Delusive safety of RAID-5
One big disadvantage of RAID-1 is its low capacity. Because you only mirror your disks you waste a lot of capacity. Regardless of how many drives you insert into your array the size will still remain the same of one single disk. RAID-5 is a compromise of redundancy and capacity. RAID-5 stripes between three or more disks to increase performance as RAID-0 does but mirrors data.
Today RAID-5 is commonly used but you should know about one great danger. RAID-5 is built for reconstructing data if exactly one disk fails but doesn't care about bad sectors on your hard drive. When one of your disks crashes and you insert a new one your array is being rebuilt. But now if one sector on one of the other disks has errors the rebuild terminates with an error, your data is lost. Modern SATA disks have an URE (unrecoverable read error) rate of 10¹⁴. That means that statistically such an error occurs every 100,000,000,000,000 bits, that's about 12TB. If you have a RAID-5 array with overall capacity of more than 12TB (e.g. 7 disks at 2TB) it's very likely that one of your hard drives will have an error which causes your RAID rebuild to fail. More expensive server disks and most SCSI devices have a better URE of 10¹⁵ which is about 115TB.
An alternative to RAID-5 is RAID-6 which protects you from failure of two drives but with increasing capacity of SATA disks also RAID-6 will be unreliable in a few years. Instead of RAID-6 you can use RAID-10 which is RAID-0 on top of several RAID-1 arrays. At least if you have more than just two disks in each RAID-1 array this method is relatively safe and reliable.
RAID is a good way to keep your data safe from disk failure but please never forget that this doesn't mean that you can't loose data. RAID only protects you from hardware damage but not from viruses, bad guys, file corruption or accidental file deletions. All changes made to your data are mirrored to all other disks in your array immediately. Therefore RAID does not replace backups but is a good way to reduce risk of data loss due to hardware failure.
Linux and RAID
Linux supports creation of disk arrays natively but you can also use hardware based RAIDs. Whether you decide in favor of hardware or software RAID is up to your needs. Linux software RAID is cheap and very fast but on some more advanced types of arrays you might want to use a separate controller to save CPU cycles. In this case you should not choose the cheapest controller you can get hold of. These are often slower than software RAIDs. Also your motherboard's native RAID “controller” is not a real controller but rather fake hardware RAID (also works with Windows only). For normal home use you wouldn't feel a difference between hardware and software RAID(-1). Only RAID-6 could have a noticeable impact on your system's performance, depending on the amount of disks you have.
Linux RAID does not natively support RAID-10 but that doesn't mean you couldn't use it. RAID-10 is nothing else than LVM on RAID-1 but also Batrfs or ZFS could be used instead of LVM. Yes, really! ZFS has a native Linux port now, not only a FUSE implementation but a real kernel module. See zfsonlinux.org for more details. At this time LVM might be the recommended way for Linux but once Batrfs or ZFS are stable enough for productive use you should consider switching to these if you don't have a good reason for sticking with complicated LVM.
Backups
As mentioned above you still need to back up all your files even if you use RAID. There are hundreds of thousands of ways to get this done but let me quickly outline a few very powerful tools.
Good old-fashioned dump/restore
dump
and restore
are two of the oldest and most proven backup utilities for Linux. Originally invented for backing up on tape you can also use them for backing up to normal files. dump
is able to create complete as well as incremental backups of your Ext2/3/4 file system (similar tools also exist for some other file system types). For this it provides 10 backup levels (0-9) where 0 is full the others incremental backups. Current versions of dump
also support levels above 9. Each level of 1 or above only backs up what's changed since last backup of one of the lower backup levels. For instance level 3 backs up all changes since the last backups of level 2, 1 and 0. dump
can only backup the whole file system incrementally. Single files can only be backed up with level 0.
To back up a file system you first have to unmount or remount it read-only.
mount -o remount,ro /dev/sdg1
dump -0uf /my/backup/file /dev/sdg1
This remounts /dev/sdg1
read-only and writes a backup to /my/backup/file
at level 0 (full backup). Parameter -u
updates /etc/dumpdates
which is important for following incremental backups. Without this parameter you wouldn't be able to do incremental backups at all.
Backup levels were originally invented to reduce the amount of tape needed by saving as few data as possible. Today hard drives are cheap but you can still use these levels to implement several backup cycles. For instance you can do a full backup (0) each Monday, a backup of 2 each week and a backup of 1 each month.
To restore a backup use
restore -if /my/backup/file
That will open an interactive restore shell for you. Type in help
for usage instruction. Exit with q
.
dump over SSH
Never keep backups on the same drive, nor in the same computer nor in the same room or even house. It's always a good idea to back up very important data on another place on earth. Be it a friend's computer or a hired server somewhere on the globe.
To move your backups securely over ethernet to another machine use SSH. Make sure an SSH server is running on the remote computer and it's configured for key auth. Then mount the remote backup drive via SFTP and copy your backup file. To ensure your file is written correctly compare output of
ssh user@host "md5sum /backup/file"
with the MD5 sum of your local file.
Since you've set up key auth for SSH you can easily create and transfer your backup via Cron.
Amanda
Amanda is a mighty but handy tool which is based on dump
/restore
but also on some other tools like tar
. It's a very sophisticated piece of software used by many administrators. Whereas my tool of choice is rsync
(see below) many prefer Amanda that's why it's mentioned here. It's definitely worth a look.
For those who like it transparent: rsync
My favorite tool is rsync
. It does its job fairly well and produces very small backups. rsync
generally creates incremental backups but one great advantage is that they behave like full backups. Once rsync
has performed a full backup only files that have changed are backed up, all other files are just hardlinked from the previous backup. Therefore you always have full copies of all your files and folders but they're all deduplicated and don't consume more disk space than needed. To create a backup of a folder run
rsync --one-file-system --archive --numeric-ids \
-link-dest=/backups/old/ /source/directoy/ /backups/new/
This backs up /source/directory/
to /backups/new/
. Unchanged files are just linked from /etc/backups/old/
and don't take any additional disk space. Also required are the parameters --archive
which tells rsync
to make backups and --numeric-ids
which ensures that file ownerships are not messed up with the destination computer's user configuration. --one-file-system
is optional and tells rsync
to not include other file systems mounted under /source/directory/
. There are also lots of other parameters to fine-tune rsync
's behavior. Just have a look at the man page.
The example above just creates a copy on the same computer (or a mounted remote drive) but of course you can also specify a host name where to push the backup to (or where to pull the files from to save the backup on this local machine). When your hostname contains a colon (:) which divides hostname and file system path a remote shell is used which should be SSH (if you didn't specify another shell with -e
or --rsh=CMD
) but you can also prepend rsync://
as protocol prefix to use a rsync server (could be run via xinetd
). Also note that there's no need of verifying file integrity since rsync
does this automatically for you. It's a very decent tool.
Snapshots with ZFS and Btrfs
Both ZFS and Btrfs do not only support logical volume management but snapshots too. Also Linux' LVM supports snapshots but LVM is an additional layer whereas ZFS and Btrfs have native file system level support for this which is quite handy. Snapshots are handled as subvolumes. To create a snapshot in Btrfs run
btrfs subvolume snapshot /mnt/mybtrfs /mnt/mybtrfs/mysnapshot
Mount the subvolume with
mount -t btrfs -o subvol=mysnapshot /dev/sdXY /mnt/mysnapshot
ZFS and LVM handle this in similar ways, no great deal. The advantage of snapshots is that they're created within seconds because they're just hardlinks not real copies. Only maintaining the differences to the current file system state takes additional disk space. Therefore you can also delete subvolumes/snapshots with btrfs subvolume delete <name>
.
By the way: snapshots are also the preferred way for backing up databases. Flush and lock up all tables, create a snapshot of the database files and unlock the tables again. That's much faster than dumping all your database contents to text files and much safer than just copying the files (you should never ever do that!).
So how to use this for backup? That's easy, just copy the snapshot to your backup drive and then delete it (the snapshot, not the backup ). But there's an even better method…
The ultimate coup: rsync + ZFS/Btrfs
That's absolutely ingenious! But hold on, we do it the other way round. We do not create snapshots and copy them over but we copy files and version them with snapshots. Btrfs and ZFS empower rsync
to back up databases but they can also be used on the other side. Remember that rsync
creates hardlinks for redundant files? That's a huge effort and takes some time. What if we just created a snapshot and then copied the new files directly over the old ones? That's not only faster it also speeds up deletion of old backups. Isn't that great?
Conclusion
Redundancy in combination with regularly performed backups is absolutely essential and Linux has a lot of simple but powerful tools to accomplish this. One last thing to say: you don't have to save all your backups until the cows come home. But you should always preserve the last few full (!) backups and their incremental additions not just the very last one. In case you have messed up files and didn't notice it you'd be glad having an even older backup than just the last one. And yes, that's also a great question: how do I ensure that my backup does not contain corrupted files? The answer is: you can't except you want to browse all files manually to check them. You can checksum all your files but that has no effect if your original file is already broken. Thus it's important to keep some older revisions of your files. How many of them you actually keep is completely up to your budget, your disk space and of course your preference. The more your have the safer your data are but the more hard drive space you need. Always ask yourself: “How important are my files? What happened if they'd get lost?” That's how you calculate risk.
Trackbacks
Someone referenced this post to answer question "Is running RAID 0 for hard drives worth it?"...
RT @reflinux: New Blog post: Thoughts on data security http://bit.ly/alUrD6