Home > Archive > PERL Beginners > October 2006 > check for duplicate files that are truncated
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
check for duplicate files that are truncated
|
|
| Eric Waguespack 2006-10-06, 6:57 pm |
| Hi, I am new to the list, so I apologise if I do anything wrong :)
I made the script below to check files that are duplicates but I only
want to check the first few bytes (perhaps 1-10k, depending on false
positives)
I would like to convert it to native perl, can someone give me some pointers?
thanks
$ cat _md5_head.pl
#!/usr/bin/perl
# usage:
# find -type f -size +1000c -exec perl _md5_head.pl {} \; | sort -n |
uniq -w32 -D
$file = shift @ARGV;
$hash = `head -c 1000 $file | md5sum | cut -f1 -d " "`;
chomp $hash;
print $hash, "\t", $file, "\n";
| |
| usenet@DavidFilmer.com 2006-10-06, 6:57 pm |
| Eric Waguespack wrote:
> I made the script below to check files that are duplicates but I only
> want to check the first few bytes (perhaps 1-10k, depending on false
> positives)
>
> I would like to convert it to native perl, can someone give me some pointers?
The IO::All module can do a lot of this for you... I once wrote a
demonstration example which just happens to include several aspects of
your requirements:
http://tinyurl.com/oxk9a
--
The best way to get a good answer is to ask a good question.
David Filmer (http://DavidFilmer.com)
| |
| John W. Krahn 2006-10-06, 6:57 pm |
| Eric Waguespack wrote:
> Hi,
Hello,
> I am new to the list, so I apologise if I do anything wrong :)
>
> I made the script below to check files that are duplicates but I only
> want to check the first few bytes (perhaps 1-10k, depending on false
> positives)
>
> I would like to convert it to native perl, can someone give me some
> pointers?
>
> thanks
>
> $ cat _md5_head.pl
>
> #!/usr/bin/perl
> # usage:
> # find -type f -size +1000c -exec perl _md5_head.pl {} \; | sort -n |
> uniq -w32 -D
>
> $file = shift @ARGV;
> $hash = `head -c 1000 $file | md5sum | cut -f1 -d " "`;
> chomp $hash;
> print $hash, "\t", $file, "\n";
This is untested but it should work:
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use Digest::MD5;
use constant BUF_SIZ => 1_000;
my $dir = shift || '.';
my %files;
find sub {
stat;
return unless -f _ or -s _ >= BUF_SIZ;
open my $fh, '<:raw', $_ or do {
warn "Cannot open '$File::Find::name' $!";
return;
};
read $fh, my $data, BUF_SIZ or do {
warn "Cannot read '$File::Find::name' $!";
return;
};
push @{ $files{ Digest::MD5->new->add( $data )->hexdigest } },
$File::Find::name;
}, $dir;
for my $chk_sum ( sort keys %files ) {
next if @{ $files{ $chk_sum } } == 1; # skip non-duplicates
for my $file ( @{ $files{ $chk_sum } } ) {
print "$chk_sum\t$file\n";
}
}
__END__
John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
|
|
|
|
|