Home > Archive > PERL Beginners > July 2005 > File Management
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| Joel Divekar 2005-07-24, 8:29 pm |
| Hi All
We have a windoz based file server with thousand of
user accounts. Each user is having thousand of files
in his home directory. Most of these files are
duplicate / modified or updated version of the
existing files. These files are either .doc or . xls
or .ppt files which are shared by groups or
departments.
Due to this my server is having terabyte of data, most
of which are redundant and our sy min has tough time
maintaining storage space.
For this I though of writing a small program to locate
similar or duplicate files stored on my file server
and delete them with the help of the user. The program
should work very fast and I don't know from where to
start.
Anybody out here to show me a direction to some links
on how to start and from there I shall take up. I
would also like to know long term solution for this
problem if any ? I am comfortable with linux or shell
programming.
Please advice. Thanks a lot.
Regards
Joel
Mumbai, India
9821421965
________________________________________
____________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs
| |
| Xavier Noria 2005-07-24, 8:29 pm |
| On Jul 23, 2005, at 7:56, Joel Divekar wrote:
> We have a windoz based file server with thousand of
> user accounts. Each user is having thousand of files
> in his home directory. Most of these files are
> duplicate / modified or updated version of the
> existing files. These files are either .doc or . xls
> or .ppt files which are shared by groups or
> departments.
>
> Due to this my server is having terabyte of data, most
> of which are redundant and our sy min has tough time
> maintaining storage space.
>
> For this I though of writing a small program to locate
> similar or duplicate files stored on my file server
> and delete them with the help of the user. The program
> should work very fast and I don't know from where to
> start.
Well, to come with the right solution one would need to play around a
bit in the server. I propose an approach based on the description
above, just in case it helps.
Since there is big number of files, we need to walk the tree at least
once, and store some data for each file to compare, I would choose a
quick test first that speeds up the tree traversal as much as
possible, purges the tree, and then do heavier operations on the
remaining candidates.
For instance:
1. Walk the tree and build a map using -s
size -> filenames
2. Purge the entries that have just one filename associated, since
they have no duplicate for sure
3. Work on the rest of the entries.
If the map in (1) gets too big to fit in a hash in memory you could
use some sort of database table, maybe something simple to setup as
SQLite. For (3), if the number of candidates is still not small you
could make an additional refinement constructing a map with MD5s,
until you get a small number of files and can compare their contents.
Trace as less as possible the tree traversal, printing to the console
a debug line for each file, for instance, would slow down the script
by orders of magnitude.
Then, to maintain that tree, I don't know, maybe the time to do this
is assumable? Running that procedure periodically might be a simple
but good enough solution.
-- fxn
| |
| Octavian Rasnita 2005-07-24, 8:29 pm |
| Also...
You can use Digest::MD5 module and create an MD5 signature for comparing the
files that have the same size.
Teddy
----- Original Message -----
From: "Xavier Noria" <fxn@hashref.com>
To: "beginners perl" <beginners@perl.org>
Sent: Saturday, July 23, 2005 10:46 AM
Subject: Re: File Management
> On Jul 23, 2005, at 7:56, Joel Divekar wrote:
>
>
> Well, to come with the right solution one would need to play around a
> bit in the server. I propose an approach based on the description
> above, just in case it helps.
>
> Since there is big number of files, we need to walk the tree at least
> once, and store some data for each file to compare, I would choose a
> quick test first that speeds up the tree traversal as much as
> possible, purges the tree, and then do heavier operations on the
> remaining candidates.
>
> For instance:
>
> 1. Walk the tree and build a map using -s
>
> size -> filenames
>
> 2. Purge the entries that have just one filename associated, since
> they have no duplicate for sure
>
> 3. Work on the rest of the entries.
>
> If the map in (1) gets too big to fit in a hash in memory you could
> use some sort of database table, maybe something simple to setup as
> SQLite. For (3), if the number of candidates is still not small you
> could make an additional refinement constructing a map with MD5s,
> until you get a small number of files and can compare their contents.
>
> Trace as less as possible the tree traversal, printing to the console
> a debug line for each file, for instance, would slow down the script
> by orders of magnitude.
>
> Then, to maintain that tree, I don't know, maybe the time to do this
> is assumable? Running that procedure periodically might be a simple
> but good enough solution.
>
> -- fxn
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> <http://learn.perl.org/> <http://learn.perl.org/first-response>
>
>
| |
| Tom Allison 2005-07-24, 8:29 pm |
| Joel Divekar wrote:
> Hi All
>
> We have a windoz based file server with thousand of
> user accounts. Each user is having thousand of files
> in his home directory. Most of these files are
> duplicate / modified or updated version of the
> existing files. These files are either .doc or . xls
> or .ppt files which are shared by groups or
> departments.
>
> Due to this my server is having terabyte of data, most
> of which are redundant and our sy min has tough time
> maintaining storage space.
>
> For this I though of writing a small program to locate
> similar or duplicate files stored on my file server
> and delete them with the help of the user. The program
> should work very fast and I don't know from where to
> start.
>
> Anybody out here to show me a direction to some links
> on how to start and from there I shall take up. I
> would also like to know long term solution for this
> problem if any ? I am comfortable with linux or shell
> programming.
>
> Please advice. Thanks a lot.
>
> Regards
>
> Joel
> Mumbai, India
> 9821421965
>
>
>
> ________________________________________
____________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
>
>
File::Find is one possibility except that it seems to behave badly when
files are being modified when the tree is being walked. My experience
of 'badly' is duplication of results. Nothing work, but something to be
aware of.
So you want to build a hash structure of FullPath => md5-hash
and then build a second hash of keys=>[files] and if the key has more
than one filename associated with it.... Then you probably want more
stat information (mtime) to decide which to purge.
This could probably be done in RAM if you are under 10^6 files.
Even if you can't hold the entire tree. You could at least do it in
chunks, like only look at files within a size range until you pare
things down a little.
|
|
|
|
|