For Programmers: Free Programming Magazines  


Home > Archive > PERL Beginners > October 2006 > check for duplicate files that are truncated









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author check for duplicate files that are truncated
Eric Waguespack

2006-10-06, 6:57 pm

Hi, I am new to the list, so I apologise if I do anything wrong :)

I made the script below to check files that are duplicates but I only
want to check the first few bytes (perhaps 1-10k, depending on false
positives)

I would like to convert it to native perl, can someone give me some pointers?

thanks

$ cat _md5_head.pl

#!/usr/bin/perl
# usage:
# find -type f -size +1000c -exec perl _md5_head.pl {} \; | sort -n |
uniq -w32 -D

$file = shift @ARGV;
$hash = `head -c 1000 $file | md5sum | cut -f1 -d " "`;
chomp $hash;
print $hash, "\t", $file, "\n";
usenet@DavidFilmer.com

2006-10-06, 6:57 pm

Eric Waguespack wrote:
> I made the script below to check files that are duplicates but I only
> want to check the first few bytes (perhaps 1-10k, depending on false
> positives)
>
> I would like to convert it to native perl, can someone give me some pointers?


The IO::All module can do a lot of this for you... I once wrote a
demonstration example which just happens to include several aspects of
your requirements:

http://tinyurl.com/oxk9a

--
The best way to get a good answer is to ask a good question.
David Filmer (http://DavidFilmer.com)

John W. Krahn

2006-10-06, 6:57 pm

Eric Waguespack wrote:
> Hi,


Hello,

> I am new to the list, so I apologise if I do anything wrong :)
>
> I made the script below to check files that are duplicates but I only
> want to check the first few bytes (perhaps 1-10k, depending on false
> positives)
>
> I would like to convert it to native perl, can someone give me some
> pointers?
>
> thanks
>
> $ cat _md5_head.pl
>
> #!/usr/bin/perl
> # usage:
> # find -type f -size +1000c -exec perl _md5_head.pl {} \; | sort -n |
> uniq -w32 -D
>
> $file = shift @ARGV;
> $hash = `head -c 1000 $file | md5sum | cut -f1 -d " "`;
> chomp $hash;
> print $hash, "\t", $file, "\n";


This is untested but it should work:

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use Digest::MD5;

use constant BUF_SIZ => 1_000;

my $dir = shift || '.';

my %files;
find sub {
stat;
return unless -f _ or -s _ >= BUF_SIZ;

open my $fh, '<:raw', $_ or do {
warn "Cannot open '$File::Find::name' $!";
return;
};

read $fh, my $data, BUF_SIZ or do {
warn "Cannot read '$File::Find::name' $!";
return;
};

push @{ $files{ Digest::MD5->new->add( $data )->hexdigest } },
$File::Find::name;

}, $dir;


for my $chk_sum ( sort keys %files ) {
next if @{ $files{ $chk_sum } } == 1; # skip non-duplicates
for my $file ( @{ $files{ $chk_sum } } ) {
print "$chk_sum\t$file\n";
}
}

__END__



John
--
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order. -- Larry Wall
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com