Home > Archive > Compression > July 2006 > Use of error correcting codes for file repair
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Use of error correcting codes for file repair
|
|
| giorgio.tani@email.it 2006-07-10, 7:55 am |
| Hi, I'm looking at the documentation of some error correcting code
algorithms and I'm wondering about how practically use it for file
repair, as some programs do.
As for I could understand, it's not computationally an easy task and
there is no real guarantee of success in all cases (even if robust
schemes can give good results in many cases), since the unavoidable
losslesness of the information in the n-bit ecc compared to the m-bit
message, so the ecc space is bound to the *expected* error ratio (i.e.
n-bit tag of ecc is used when n wrong bit are expected in the message
of m bit).
Many schemes break the message in smaller blocks and calculate the
check for each block, in order to reduce the complexity of error
recovery.
I'm right since now or I'm missing something?
Now, given those things, we have that many (if not nearly all)
filesystems, low level storage schemes and communication protocols have
a ECC scheme, oriented on work on small batch of data giving an
overhead (transparent to the user at an higher abstraction level) that
makes sense for the given protocol (i.e. proportional to the probable
ratio of phisical failures of the disk or of the communication
channel).
My question is: does it makes sense build a ECC scheme at an higher
abstraction level, in example defining a file format (or an archive
format) or we should rely on lower level ECC without worrying?
AFAIK it would be very time consuming thing try to reverse a CRC on
even a buffer of few KB of data, and otherwise it would be quite costly
(as in time as in overhead, that will sum, in the case of a file, to
the overheads still introduced by phisical disk schemes and by the
filesystem) doing it on few bits in order to give to the file format
the capability to quickly recover from data corruption.
In other worlds, I'm wondering if my conclusion are wrong or are
somewhat right:
- error checking/correction should be done at the lower abstraction
level possible, working on small quantity of data, like on disks and
networking protocols, allowing quick and transparent error recovery;
- error checking/correction shouldn't be done at high level of
abstraction, i.e. defining file format, since it will be less
functional (introducing utter overhead, or being less reliable and more
time consumning increasing the size of segment checked), since they
cannot substitute lower level error checking/correction since each
communication channel could have a different error ratio, different
error patterns and so have different optimal error checking/correcting
schemes, and finally because after all it can't fully substitute the
role of the backup.
On the other size, the better flexibility and the biggest memory
resource that can be used working at higer level (i.e. a software
running on a personal instead that on a resource-limited network
apparatus or on a onboard disk logic) makes more sense to delegate to
this level different kinds of checking, like checksum, MAC
authenticated encryption and so on.
Could someone please confirm if my thoughts are somewhat right or not?
Thank you in advance for any answer.
| |
| Mark Adler 2006-07-10, 6:55 pm |
| giorgio.tani@email.it wrote:
> My question is: does it makes sense build a ECC scheme at an higher
> abstraction level, in example defining a file format (or an archive
> format) or we should rely on lower level ECC without worrying?
It is of course possible and not uncommon for the damage to be too
great for the lower level ECCs to be able to correct the data, in which
case a higher level ECC can provide additional protection. Such
"concatenated" codes are common in applications with a requirement for
very low error rates. Furthermore, from what I've seen with respect to
queries about how to recover corrupted zlib data (usually you can't),
there appear to be ways to introduce errors in the data due to
unintended deterministic, but non-reversible conversions.
So, depending on your level of paranoia and the importance of the data,
higher level ECCs can make sense. It is also important, especially for
compression formats, to contain the errors to allow for partial
recovery by limiting the extent of history dependence.
mark
| |
| giorgio.tani@email.it 2006-07-10, 6:55 pm |
| Thanks for the answer! If I may, I've some other doubts that comes to
my mind:
> It is of course possible and not uncommon for the damage to be too
> great for the lower level ECCs to be able to correct the data, in which
> case a higher level ECC can provide additional protection.
Analyzing wider amount of data with a good ECC scheme make more
probable to find errors, the error checking part of the system is
enhanced by the fact of working on more input data.
But the error correcting part become (afaik, but obviusly I can be
wrong) increasingly difficult as the input size increase, so my doubt
is: what are the feasibility limits in this field?
Are there practical limits (some hundreds of bytes? some KB? some
MB?...) for the data to ECC at once to obtain a reasonable ease of
finding the needed correction?
Is known a limit of size of the input when is computationally
reasonable to drop the *correction* side of the ECC and stay on the
*checking* side, maybe replacing the checksum or CRC with something
more robust to identify complex modification (and even malicious
attacks) like an hash or better an hmac or authenticated encryption and
whenever possible trying to ask for resending the data rather than
trying to correct it with the rather small ECC information (compared to
the width of the message)?
> It is also important, especially for
> compression formats, to contain the errors to allow for partial
> recovery by limiting the extent of history dependence.
This make me recalling a recent thread about random access to archive
content, that may be useful for a practical example of my doubts:
http://groups.google.it/group/comp....62f1bef92ddd0cc
In a sistem similar to the one I describe, basically zcompress on
slices of some KB of uncomperessed input, saving the compressed size of
those slices in order to allow the user (that know where the data were
in the original uncompressed file) to surf the file to the block
containing the desired data without having to uncompress all the data,
would be useful to implement, say, a CRC32 on each block in order to
allow error correction?
My doubt is, if even a CRC32 may be enough good to provably recognize
many known cases of errors on some KB of data, would it be good to
allow, with his tiny overhead of 4 byte of information, to allow an
efficient (computationally feasible, with reasonably high success
ratio) error correction of the block?
(However I suspect that many of the answers basically may be reduce to:
if the data can be resend (or if exist a backup of the file), it's
usually more efficient avoiding the correction attempt and discard the
corrupted data, if it's not feasible to resend (or if there are no
backup copy of the file), the correction attempt is the only last
resort to have the data...)
| |
| Mark Adler 2006-07-10, 6:55 pm |
| giorgio.tani@email.it wrote:
> Analyzing wider amount of data with a good ECC scheme make more
> probable to find errors, the error checking part of the system is
> enhanced by the fact of working on more input data.
No, actually you'd rather have smaller chunks of data relative to the
amount of ECC information to get better correction performance.
> Are there practical limits (some hundreds of bytes? some KB? some
> MB?...) for the data to ECC at once to obtain a reasonable ease of
> finding the needed correction?
You need to read about Reed-Solomon codes. To first order, adding 2n
bytes of ECC data allows you to correct up to n bytes of errors
anywhere in the stream, or correcting up to 2n bytes of erasures of
known locations.
> maybe replacing the checksum or CRC with something
> more robust to identify complex modification (and even malicious
> attacks) like an hash or better an hmac or authenticated encryption
Correcting errors and authenticating the data are two completely
different objectives, and solved by two different approaches. Error
correction requires solvable (usually linear) equations to restore the
data, whereas authentication is exactly the opposite requiring as
irreversible a set of equations as possible to assure that a different
set of input cannot be constructed that has the same signature. So
what you do is apply a signature, and then apply the ECC coding. Then
on the other end you can check and correct the data, and then verify
the authentication.
> would be useful to implement, say, a CRC32 on each block in order to
> allow error correction?
CRCs cannot enable error correction on any but the most trivial cases
(i.e. a stream equal to or shorter than the length of the CRC itself).
CRCs are for error detection, not correction.
mark
| |
| giorgio.tani@email.it 2006-07-10, 6:55 pm |
| > CRCs cannot enable error correction on any but the most trivial cases
> (i.e. a stream equal to or shorter than the length of the CRC itself).
> CRCs are for error detection, not correction.
Thank you for the detailed answer, in facts many of the doubts on ECC I
had were born from observing file recovery applications that works on,
i.e., zip or gzip files that does "recovery" of corrupted archives.
Since those formats features a CRC32 tag at the end of each file I was
fooled thinking that those application would try to analyze the
corupted files and the corresponding CRC and try to correct the file,
but actually as for I can see they don't do that.
Observing more in details the operation of those softwares it seem that
they mainly discard files with wrong CRC and save files with correct
CRC to a new archive.
In fact I was fooled because it's a totally different thing than trying
to recover the corrupted data using the CRCs, so I was wondering under
what limits was it feasible since CRC doesn't bring a seizeable amount
of information and are not so fit to be reversed for error correction,
but now I see that it's a non issue mainly that softwares basically
don't try to do it and as you confirm to me it's quite unfeasible to
try to correct seizeable amount of data given it's CRC and different
agorithms, more fit for error correction, should be used to feature a
true error recovery of the actually corrupted data.
I read an interesting chapter about them on MacKay's Information Theory
book and Wikipedia has IMHO an interesting section about them, however,
if I may ask, where could I find some example implementations of them
(in C or better in Delphi/Pascal) applied for error recovery from
files?
Thank you again for the kind and detailed answers.
Giorgio
| |
|
| look up reed soloman cross interleave code
| |
| John Reiser 2006-07-10, 6:55 pm |
| Mark Adler wrote:
> giorgio.tani@email.it wrote:
>
>
> You need to read about Reed-Solomon codes. To first order, adding 2n
> bytes of ECC data allows you to correct up to n bytes of errors
> anywhere in the stream, or correcting up to 2n bytes of erasures of
> known locations.
Also learn about general Viterbi codes. You can make the Hamming
distance between valid code words as large as you want (the output
expands by a constant factor which increases as the Hamming distance
increases), and still be able to decode noisy input uniquely.
The encoding is fast, the decoding is slow in proportion to the
actual noise (takes many passes in a feedback loop).
--
| |
| giorgio.tani@email.it 2006-07-11, 3:55 am |
| > > You need to read about Reed-Solomon codes. To first order, adding 2n
> Also learn about general Viterbi codes. You can make the Hamming
> distance between valid code words as large as you want (the output
> expands by a constant factor which increases as the Hamming distance
> increases), and still be able to decode noisy input uniquely.
> The encoding is fast, the decoding is slow in proportion to the
> actual noise (takes many passes in a feedback loop).
Thank you all for your answers.
I apologize because of I begun the tread with some confusion about data
recovery, since I was after looking at some archive repair
utilities: my confusion was due to the fact that they work on format
that doesn't contain proper error recovery data, but finally I got that
they actually doesn't do a data recovery in the sense of repairing
corrupted data but basically identify and discard corrupted data and
write correct data to a new archive (in that sense, they do what they
promise and repair the archive, allowing to recover as much content as
possible, but actually they don't repair the corrupted section of the
archive, as I wrongly understood firstly).
However, a very interesting thread come out and I've some interesting
topics to look at, so thank you all for having taken time to answer.
| |
|
| On 2006-07-10, giorgio.tani@email.it <giorgio.tani@email.it> wrote:
> My question is: does it makes sense build a ECC scheme at an higher
> abstraction level, in example defining a file format (or an archive
> format)
This is what PAR is.
It uses something called a Reed-Solomon code
Bye.
Jasen
| |
|
| On 2006-07-10, giorgio.tani@email.it <giorgio.tani@email.it> wrote:
> Is known a limit of size of the input when is computationally
> reasonable to drop the *correction* side of the ECC and stay on the
> *checking* side, maybe replacing the checksum or CRC with something
> more robust to identify complex modification (and even malicious
> attacks) like an hash or better an hmac or authenticated encryption and
> whenever possible trying to ask for resending the data rather than
> trying to correct it with the rather small ECC information (compared to
> the width of the message)?
as size increases more correction coding is indicated as the probability of
corruption also increases
> This make me recalling a recent thread about random access to archive
> content, that may be useful for a practical example of my doubts:
> http://groups.google.it/group/comp....62f1bef92ddd0cc
> In a sistem similar to the one I describe, basically zcompress on
> slices of some KB of uncomperessed input, saving the compressed size of
> those slices in order to allow the user (that know where the data were
> in the original uncompressed file) to surf the file to the block
> containing the desired data without having to uncompress all the data,
> would be useful to implement, say, a CRC32 on each block in order to
> allow error correction?
CRC32 only allows error dectection,
> My doubt is, if even a CRC32 may be enough good to provably recognize
> many known cases of errors on some KB of data, would it be good to
> allow, with his tiny overhead of 4 byte of information, to allow an
> efficient (computationally feasible, with reasonably high success
> ratio) error correction of the block?
what sizeo of error do you want to protect against, how much bad data must
be tolerated before the correction scheme is allowed to fail?
> (However I suspect that many of the answers basically may be reduce to:
> if the data can be resend (or if exist a backup of the file), it's
> usually more efficient avoiding the correction attempt and discard the
> corrupted data, if it's not feasible to resend (or if there are no
> backup copy of the file), the correction attempt is the only last
> resort to have the data...)
yeah that works too, a backup copy is a kind of error correction,
it's mirroring.
Bye.
Jasen
| |
| Willem 2006-07-12, 3:55 am |
| jasen wrote:
) On 2006-07-10, giorgio.tani@email.it <giorgio.tani@email.it> wrote:
)
)> My question is: does it makes sense build a ECC scheme at an higher
)> abstraction level, in example defining a file format (or an archive
)> format)
)
) This is what PAR is.
)
) It uses something called a Reed-Solomon code
Well, technically it doesn't. :-) It uses something that is quite
similar, and a lot simpler to program, but fails every once in a while.
(Fails as in it needs more than N repair packets to repair N lost packets.)
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| cr88192 2006-07-13, 3:55 am |
|
<giorgio.tani@email.it> wrote in message
news:1152535077.661879.179720@m73g2000cwd.googlegroups.com...
> Hi, I'm looking at the documentation of some error correcting code
> algorithms and I'm wondering about how practically use it for file
> repair, as some programs do.
<snip>
>
maybe usable (simple but expensive wrt data inflation):
the data is broken into blocks.
you store 2 data blocks, and a 3rd block consisting of the first 2 xor'ed
together.
now, assuming you have any 2 of the 3, it is possible to recover whatever
data was present in the missing block. right here is about a 1/3 overhead.
for added safety, one can xor some of the other blocks together to allow
forther protection for those blocks. eg, every 6 blocks, is another block
containing the xor or the last 2 xor'ed blocks.
DDPDDPX, where D is data, P is parity, and X is xor'ed parity (allowing
recovery if both 1 data and a parity block are bad for the same set of 3,
but one is still SOL if both data blocks are bad). this is 3/4 inflation.
or DDPDDPOE, where O is the xor of the odd blocks, and E the xor of the even
(allowing recovery if both data blocks and a parity in a set of 3 are
corrupted, so long as at least 2 of the other, and the O/E blocks remain).
this is 2x the original size.
or, DDDDOE, 50% inflation (2 consecutive, but fails if 2 even or 2 odd are
bad).
of course, for any of these, one needs to be able to tell which blocks are
bad.
this would mandate computing checksums to detect the bad blocks, and
possibly every so often storing a block full of checksums (these could be
protected similarly to allow making sure things are correct).
alternatively the checksum could be stored in each block.
additionally, if one seperates the blocks over a larger area (say, all the
odd blocks procede sequentially, all the even blocks are in reverse, and all
the parity blocks start from the middle and work twards the ends or some
other order). then the format is made much more immune to any localized
corruption.
this combined with the original scheme and storing checksums somewhere I
think would be sufficient...
or such...
| |
| Ignorant 2006-07-13, 6:55 pm |
|
Check with standards of CD recording. They have a lot of error
correction built in (?)
as it is a very error prone state of the mart media technology.
cr88192 wrote:
> <giorgio.tani@email.it> wrote in message
> news:1152535077.661879.179720@m73g2000cwd.googlegroups.com...
> <snip>
>
> maybe usable (simple but expensive wrt data inflation):
> the data is broken into blocks.
> you store 2 data blocks, and a 3rd block consisting of the first 2 xor'ed
> together.
>
> now, assuming you have any 2 of the 3, it is possible to recover whatever
> data was present in the missing block. right here is about a 1/3 overhead.
>
> for added safety, one can xor some of the other blocks together to allow
> forther protection for those blocks. eg, every 6 blocks, is another block
> containing the xor or the last 2 xor'ed blocks.
> DDPDDPX, where D is data, P is parity, and X is xor'ed parity (allowing
> recovery if both 1 data and a parity block are bad for the same set of 3,
> but one is still SOL if both data blocks are bad). this is 3/4 inflation.
>
> or DDPDDPOE, where O is the xor of the odd blocks, and E the xor of the even
> (allowing recovery if both data blocks and a parity in a set of 3 are
> corrupted, so long as at least 2 of the other, and the O/E blocks remain).
> this is 2x the original size.
>
> or, DDDDOE, 50% inflation (2 consecutive, but fails if 2 even or 2 odd are
> bad).
>
> of course, for any of these, one needs to be able to tell which blocks are
> bad.
> this would mandate computing checksums to detect the bad blocks, and
> possibly every so often storing a block full of checksums (these could be
> protected similarly to allow making sure things are correct).
>
> alternatively the checksum could be stored in each block.
>
>
> additionally, if one seperates the blocks over a larger area (say, all the
> odd blocks procede sequentially, all the even blocks are in reverse, and all
> the parity blocks start from the middle and work twards the ends or some
> other order). then the format is made much more immune to any localized
> corruption.
>
> this combined with the original scheme and storing checksums somewhere I
> think would be sufficient...
>
>
> or such...
| |
| giorgio.tani@email.it 2006-07-14, 7:55 am |
| Afaik the scheme you proposed is feasible, simple to implement and
quite fast to run both for generating error recovey data and for
attempting data repair, but the problem using parity as error
correcting strategy is firstlyt that it impose a seizeable overhead, as
you had adviced in your post.
Now I'm about learning more on Reed-Solomon and other more modern ECC
and trying how to implement it with Delphi/Freepascal (I was not able
to find many examples of code in that language, if someone may suggest
me a usable library it will be very appreciate) in order to try to do
somerhing similar to RAR recovery records (by the way, afaik RAR times
ago used parity for error recovery but replaced that scheme with a more
modern one based on ECC... I woluld like to do something similar or
find a library that offer primitives to perform such a function).
| |
| Mark Adler 2006-07-14, 6:55 pm |
| Ignorant wrote:
> Check with standards of CD recording. They have a lot of error
> correction built in (?)
> as it is a very error prone state of the mart media technology.
They use Reed-Solomon codes. Interleaved R-S codes are particularly
good, as compared to other codes, at long bursts of errors, which is
what you get from a scratch.
mark
|
|
|
|
|