For Programmers: Free Programming Magazines  


Home > Archive > AWK > January 2008 > checking for corrupted files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author checking for corrupted files
Seb

2008-01-29, 7:01 pm

Hi,

I would like to check if some simple text files have been corrupted. A
manual/visual check of a few of the files shows that some of them
contain "garbage" characters in them. I can't directly "see" what those
characters are, but they can be found at any part of the file. The
information I'm after is simply the name of the file that is corrupted.
So I thought the following would do:

awk '/[^[:alnum:]]/ {print FILENAME}' *

Since, if IIUC, [:alnum:] represents all alphabet letters (upper and
lower case) and all digits, punctuation marks and symbols, which are
part of the uncorrupted files. Basically, print the file name a line
does NOT match any of these characters. Is this a good way to spot
those corrupted files?


Cheers,

--
Seb
Ed Morton

2008-01-29, 10:03 pm



On 1/29/2008 4:09 PM, Seb wrote:
> Hi,
>
> I would like to check if some simple text files have been corrupted. A
> manual/visual check of a few of the files shows that some of them
> contain "garbage" characters in them. I can't directly "see" what those
> characters are, but they can be found at any part of the file. The
> information I'm after is simply the name of the file that is corrupted.
> So I thought the following would do:
>
> awk '/[^[:alnum:]]/ {print FILENAME}' *
>
> Since, if IIUC, [:alnum:] represents all alphabet letters (upper and
> lower case) and all digits, punctuation marks and symbols,


alnum = ALpha NUMeric, i.e. alphabetic and numeric characters, no punctuation
marks, symbols or anything else.

> which are
> part of the uncorrupted files. Basically, print the file name a line
> does NOT match any of these characters. Is this a good way to spot
> those corrupted files?


Beats me since I don't know what the your files can legally contain and so don't
know what it means for them to be "corrupted", but to detect control characters
you'd use the "[:cntrl:]" character class:

awk '/[[:cntrl:]]/ {print FILENAME}' *

Note though, that the presence of control characters doesn't always mean a
file's been corrupted (e.g. people use control-Ls to separate functions in
source code to force form-feed to printers). Maybe some other character class or
combination of such would be more appropriate. See
http://www.gnu.org/software/gawk/ma...har_002dclasses for
the list.

Ed.


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com