Shielding your files with Reed-Solomon codes

 

 

     
(July 2008)

(Based on an idea I had a decade ago)

Changelog:

August 3, 2008:Slashdotted!
August 4, 2008:Made a plain-C package available, to support 64-bit OSes (as well as OS/X and Cygwin users).

Shield my files? Why?

You know why!

Have you never lost a file because of storage media failure? That is, have a hard drive or USB stick lose a bunch of sectors (bad sectors) that simply happened to be the ones hosting parts (or all) of your file?

I have. More than once... :-)

The way storage media quality has been going in the last years, it is bound to happen to you, too. When it does, believe me, you'll start to think seriously about ways to protect your data. And you'll realize that there's quite a lot of technology to choose from...

  • "Backup, backup, backup! And take it with you!" This is a valid and wise suggestion, but it doesn't address the details of backing up... There's more than one way to backup ; my personal favorite has a lot of features you'll probably like. Unfortunately, backups themselves are also stored in some kind of storage, so the question is how are you certain that your backup storage won't fail? And equally important, how often will you backup? If you do it once per week, you might end up losing a week's worth of work - is that acceptable in your line of work?
  • Others will advocate RAID. Use a RAID scheme on more than one disks, and when one fails, the machine will keep on working with the rest - at least in theory. In practice, faulty RAID controllers (especially the on-board garden variety) can wreck havoc just as much as the faulty storage media can. If you decide to go for RAID, I suggest you use your OS support for software RAID (e.g. Linux md), and I also suggest using the simplest possible building blocks: RAID1 (mirrors), or if you really need speed, RAID10 (stripes of mirrors).
  • Then again, neither backup nor RAID would save you from accidental deletions or file corruptions. Today's word processors (thinking of Mr. Clippy, not LaTEX) are such complex beasts that their crashing is considered a normal everyday activity (which is why they were "enhanced" years ago with periodic auto-saves). On one of these crashes, chances are you'll find your document corrupted. A solution, you ask? Simple: Use version control... Subversion or Git are wonders of the world - the former even has nice GUIs for non-technical folk. You can then recover from deletions and corruptions, since your repository would provide the file again.
       Then again, you may be forced to work in "lone wolf" mode... Working with your laptop in airport lounges and dark, secluded caves (known as "hotels"). Access to the web may be missing or firewalled, and therefore there may be no way to hook your laptop to your company's repository...
  • Burning to re-writable DVDs? Chances are that when disaster strikes, you will find your precious backup DVD is scratched... Or that your USB stick in your keychain didn't survive the constant scratching from your keys...
My point?

There's no such thing as "enough protection" for your data - the more you have, the better the chances that your data will survive disasters.

What follows is a simple description of a way I use to additionally "shield" my important files (a daily "tar" I take every night through a cron job), so that even if some sectors hosting them are lost, I still end up salvaging everything.

Algorithm

The idea behind this process is error correcting codes, like for example the ubiquitous Reed-Solomon. With Reed-Solomon, parity bytes are used to protect a block of data from a specified maximum number of errors per block. In the tools described below, a block of 223 bytes is shielded with 32 bytes of parity. The original 223 bytes are then morphed into 255 "shielded" ones, and can be recovered even if 16 bytes from inside the "shielded" block turn to noise...

Storage media are of course block devices, that work or fail on 512-byte sector boundaries (for hard disks and floppies, at least - in CDs and DVDs the sector size is 2048 bytes). This is why the shielded stream must be interleaved every N bytes (that is, the encoded bytes must be placed in the shielded file at offsets 1,N,2N,...,2,2+N,etc): In this way, 512 shielded blocks pass through each sector (for 512 byte sectors), and if a sector becomes defective, only one byte is lost in each of the shielded 255-byte blocks that pass through this sector. The algorithm can handle 16 of those errors, so data will only be lost if sector i, sector i+N, sector i+2N, ... up to sector i+15N are lost! Taking into account the fact that sector errors are local events (in terms of storage space), chances are quite high that the file will be completely recovered, even if a large number of sectors (in this implementation: up to 127 consecutive ones) are lost.

I implemented this scheme back in 2000 for my diskettes (remember them?). Recently, I discovered that Debian comes with a similar utility called rsbep, which after a few modifications is perfect for providing adequate shielding to your files.

Download

Here is the source code for my customization of rsbep, a utility that implements the kind of Reed-Solomon-based "shielding" that we talked about. The package includes x86 assembly that makes it an order of magnitude faster than plain C ; if however you are not on a 32bit x86 platform, you can use this portable C version instead (a lot slower, unfortunately). rsbep is part of dvbackup, so some Debian users might already have it installed; my version however addresses some issues toward the goal we are seeking here, which is error-resiliency for files against the common, bursty types of media errors. More information on what was changed is below.

The package is easily installed under Linux, Windows(cygwin) and Free/Net/OpenBSD, with the usual

./configure 
make 
make install

Results

Here is a self-healing session in action:
home:/var/tmp/recovery$ ls -la
total 4108
drwxr-xr-x 2 ttsiod ttsiod    4096 2008-07-30 22:21 .
drwxrwxrwt 5 root   root      4096 2008-07-30 22:21 ..
-rw-r--r-- 1 ttsiod ttsiod 4194304 2008-07-30 22:21 data

home:/var/tmp/recovery$ freeze.sh data > data.shielded
home:/var/tmp/recovery$ ls -la
total 9204
drwxr-xr-x 2 ttsiod ttsiod    4096 2008-07-30 22:21 .
drwxrwxrwt 5 root   root      4096 2008-07-30 22:21 ..
-rw-r--r-- 1 ttsiod ttsiod 4194304 2008-07-30 22:21 data
-rw-r--r-- 1 ttsiod ttsiod 5202000 2008-07-30 22:21 data.shielded

home:/var/tmp/recovery$ melt.sh data.shielded > data2
home:/var/tmp/recovery$ md5sum data data2
9440c7d2ff545de1ff340e7a81a53efb  data
9440c7d2ff545de1ff340e7a81a53efb  data2

home:/var/tmp/recovery$ echo Will now create artificial corruption 
home:/var/tmp/recovery$ echo of 127 times 512 which is 65024 bytes

home:/var/tmp/recovery$ dd if=/dev/zero of=data.shielded bs=512 \
			    count=127 conv=notrunc
127+0 records in
127+0 records out
65024 bytes (65 kB) copied, 0,00026734 seconds, 243 MB/s

home:/var/tmp/recovery$ melt.sh data.shielded > data3

rsbep: number of corrected failures   : 64764
rsbep: number of uncorrectable blocks : 0

home:/var/tmp/recovery$ md5sum data data2 data3
9440c7d2ff545de1ff340e7a81a53efb  data
9440c7d2ff545de1ff340e7a81a53efb  data2
9440c7d2ff545de1ff340e7a81a53efb  data3
For those of you that don't speak UNIX, what you see above is a simple exercise in destruction: we "shield" a file with the freeze.sh script, which is part of my package; we then melt.sh the frozen file, and verify (through md5sum) that the new generated file is exactly the same as the original one. We then proceed to deliberately destroy 64KB of the shielded file (that's a lot of consecutive sectors!), using dd to overwrite 127 sectors with zeros. We invoke melt.sh again, and we see that the new generated file (data3) has the same MD5 sum as the original one - it was recovered perfectly.

Changeset from original rsbep

In case you are wondering why I had to modify rsbep here's where my version differs from the original...
  • The original version wrote 3 parameters of Reed-Solomon as a single line before the "shielded" data, and this made the stream fragile (if this information was lost, decoding failed...)
  • It uses a default value of 16*255=4080 for parameter R, and it can thus tolerate 4080*16=65280 consecutive bytes to be lost anywhere in the stream, and still recover...
  • It adds file size information in the shielded stream, so the recovery process re-creates an exact copy of the original.
  • I added autoconf/automake support... I also created two packages, one for 32bit x86 machines (which uses a fast assembly implementation), and another which is in plain-C (and thus compiles and installs cleanly on many operating systems (Linux, Free/Net/OpenBSD, Windows(Cygwin/MinGW), etc).

Conclusion

These tools works fine for me, and I always use them when I backup data or move them around (e.g. from work to home). As an example, when I move my Git repository around, I always...
cd /path/to/mygit/
git gc 
cd ..
git clone --bare mygit mygit.bare 
tar jcpf mygit.tar.bz mygit.bare
freeze.sh mygit.tar.bz2 > /mnt/usbStick/mygit.tar.bz2.shielded
If you so wish, feel free to add a GUI layer over them... (I am a console kind of guy - I never code GUIs unless I really have to :-)
And yes, they have already saved my data a couple of times.
Back to homepage Last update on: Sun Sep 28 11:33:00 2008