For Programmers: Free Programming Magazines  


Home > Archive > Unix Programming > November 2004 > Unix C programming for finding file









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Unix C programming for finding file
dawn

2004-11-24, 4:01 pm

Hi all,
I am now working on a C program under Unix.
The requirement for the program is that:
A file name is passed to program as a parameter. The program will
Find files under a specified directory. The matched file must have the
same content with the given file. It does not matter whether the
filenames are the same.

It is easy to find file that has the same name with given file, but
may be hard to find the files that with the same content. In my
knowledge, i give two solutions:

1) Go throught the directory and its sub direcotry tree, and when
meeting a file, Use the stand C library function to open that file and
the given file, and then compare those contents in buffers to see if
they are the same.

2) Go through the directory and its sub deirectroy tree, and when
meeting a file, execute the system shell command "diff" to compare the
two files to see if they are the same in content.


The two solutions seems not very smart and they are very running
slowly. I wonder if there is any library function that can compare two
file contents just like "strcmp" to compare two string.
Or maybe there are some other smart ways to achieve it.

Thank you for suggestions.
Måns Rullgård

2004-11-24, 4:01 pm

dawn <ming@lian.com> writes:

> Hi all,
> I am now working on a C program under Unix.
> The requirement for the program is that:
> A file name is passed to program as a parameter. The program
> will Find files under a specified directory. The matched file
> must have the same content with the given file. It does not
> matter whether the filenames are the same.


find dirname -type f -exec cmp -s filename {} \; -print

--
Måns Rullgård
mru@inprovide.com
Bjørn Augestad

2004-11-24, 4:01 pm

dawn wrote:

> Hi all,
> I am now working on a C program under Unix.
> The requirement for the program is that:
> A file name is passed to program as a parameter. The program will
> Find files under a specified directory. The matched file must have the
> same content with the given file. It does not matter whether the
> filenames are the same.
>
> It is easy to find file that has the same name with given file, but
> may be hard to find the files that with the same content. In my
> knowledge, i give two solutions:
>
> 1) Go throught the directory and its sub direcotry tree, and when
> meeting a file, Use the stand C library function to open that file and
> the given file, and then compare those contents in buffers to see if
> they are the same.
>
> 2) Go through the directory and its sub deirectroy tree, and when
> meeting a file, execute the system shell command "diff" to compare the
> two files to see if they are the same in content.
>
>
> The two solutions seems not very smart and they are very running
> slowly. I wonder if there is any library function that can compare two
> file contents just like "strcmp" to compare two string.
> Or maybe there are some other smart ways to achieve it.



You can't use strcmp(), but you can use memcmp(). All you have to do is
to memory map (man mmap) the two files and then call memcmp().

HTH
Bjørn



>
> Thank you for suggestions.

dawn

2004-11-24, 4:01 pm

Måns Rullgård wrote:
> dawn <ming@lian.com> writes:
>
>
>
>
> find dirname -type f -exec cmp -s filename {} \; -print
>

It is a shell command, Is it possible to implement it using C language?
dawn

2004-11-24, 4:01 pm

Bjørn Augestad wrote:
> dawn wrote:
>
>
>
>
> You can't use strcmp(), but you can use memcmp(). All you have to do is
> to memory map (man mmap) the two files and then call memcmp().
>

But if a file is very large, loading the whole file to memory may be
very uncontrolable. If use a buffer, it should use loops to compare the
contents. It should be very slow to process the finding.
[color=darkred]
> HTH
> Bjørn
>
>
>
Måns Rullgård

2004-11-24, 4:01 pm

dawn <ming@lian.com> writes:

> Måns Rullgård wrote:
> It is a shell command, Is it possible to implement it using C language?


Certainly, the "find" command is written in C. But why reinvent the
wheel, unless it's homework.

--
Måns Rullgård
mru@inprovide.com
dawn

2004-11-24, 4:01 pm

Måns Rullgård wrote:
> dawn <ming@lian.com> writes:
>
>
>
>
> Certainly, the "find" command is written in C. But why reinvent the
> wheel, unless it's homework.
>

It is really a homework. I am a colledge student of Computer Science.
This programming is homework of the class APUE (Advanced Programming in
the Unix Environment). Our teacher tell us to implement it in C
language, i just can't find a convenient way to achieve it.
Stephane CHAZELAS

2004-11-24, 4:01 pm

2004-11-24, 23:27(+08), dawn:
> I am now working on a C program under Unix.
> The requirement for the program is that:
> A file name is passed to program as a parameter. The program will
> Find files under a specified directory. The matched file must have the
> same content with the given file. It does not matter whether the
> filenames are the same.
>
> It is easy to find file that has the same name with given file, but
> may be hard to find the files that with the same content. In my
> knowledge, i give two solutions:

[...]

(note that to walk a directory tree, there's the standard C
library function nftw(3))

You may compute a checksum of the files and compare only the
checksums, that would prevent having to read the reference file
many times.

But that would not be much of a trouble as that file would be in
the cache, and you'd need to read the whole files before be able
to stat wether files are different (while with the basic
comparison, you can stop at the first difference found).

However the checksum solution would reveal more efficient if
there are several files to look after (you would then store and
reuse the checksum list).

Note that cmp is better than diff. It just checks that the file
are the same or not, it doesn't try to find out what the
differences are. Moreover, with -s, it may start with comparing
the file sizes and not even compare the content.

--
Stephane
Nick Landsberg

2004-11-24, 4:01 pm

dawn wrote:

> M=E5ns Rullg=E5rd wrote:
>=20
e?[color=darkred]
> It is really a homework. I am a colledge student of Computer Science.=20
> This programming is homework of the class APUE (Advanced Programming in=

=20
> the Unix Environment). Our teacher tell us to implement it in C=20
> language, i just can't find a convenient way to achieve it.


OK, some hints which hopefully do not wind up doing your
homework for you ---

There are two parts to this problem:
- directory traversal
- comparison

To do the directory traversal, see the manpage(s) for
opendir() and readdir(). Solve this as a separate
problem from the comparison problem. Make sure you can
go down to some reasonable level of sub-directories,
(possibly using recursion).

As for the comparison, there are other things than
file contents which you can use to minimize the
amount of time you spend actually comparing files. See the
man page for stat() and lstat(). What information
returned by these calls can you use to determine
definitively that the file is different? Once you
figure that out, you are left with (hopefully) comparing
only a handful of files.

Past that point, you're on your own.

Hope this helps.

NPL


--=20
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
Jens.Toerring@physik.fu-berlin.de

2004-11-24, 4:01 pm

dawn <ming@lian.com> wrote:
> Bjørn Augestad wrote:
> But if a file is very large, loading the whole file to memory may be
> very uncontrolable. If use a buffer, it should use loops to compare the
> contents. It should be very slow to process the finding.


You can only avoid loading all files into memory if you compare
sizes before you do the real comparison, that's a cheap operation
and files with different lengths obviously can't be identical.
After that test you still have to load the remaining files to be
able to compare them - how are you supposed to compare them other-
wise without looking at their contents? Of course, they may not
need to be loaded completely into memory if they differ already
near to the start of the files (but mmap() probably won't load
them in toto but just as much as is currently needed). Your best
bet is probably to mmap() files that have the same length as the
one you're comparing against and than use memcmp() on chunks of
(multiples of) the machines page size. Don't try to compare them
byte by byte in a loop - memcmp() is probably much more effective
in most non-pathological cases.

If you want to get around the necessity to load file of equal length
and the files hardly ever change you could get away with storing a
hash value for all of the files on the disk. For this you have to
open all the files once and do some computations, but then you only
have to take files with equal hash values into consideration later
on - and these are extremely likely to be identical.

Regards, Jens
--
\ Jens Thoms Toerring ___ Jens.Toerring@physik.fu-berlin.de
\__________________________ http://www.toerring.de
Roger Leigh

2004-11-24, 4:01 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

dawn <ming@lian.com> writes:

> Bjørn Augestad wrote:
> But if a file is very large, loading the whole file to memory may be
> very uncontrolable. If use a buffer, it should use loops to compare
> the contents. It should be very slow to process the finding.


mmap() maps the file into memory. It will be far more efficient than
reading into a buffer, since there's no memory allocation involved
(you can map the entire file, or portions if you choose). If you read
up on the differences between read() and mmap(), you'll see why.


Regards,
Roger

- --
Roger Leigh
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
Debian GNU/Linux http://www.debian.org/
GPG Public Key: 0x25BFB848. Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFBpLb1VcFcaSW/ uEgRAspjAKDF5aroJfH6HrrJ8+JD+21anNC0NwCg
qHiQ
4zT/QdkAaDxXrl4MtUzdw/U=
=Ivw6
-----END PGP SIGNATURE-----
Randy Howard

2004-11-24, 4:01 pm

In article <bx2pd.966892$Gx4.804330@bgtnsc04-news.ops.worldnet.att.net>,
SPAMhukolauTRAP@SPAMworldnetTRAP.att.net says...
> OK, some hints which hopefully do not wind up doing your
> homework for you ---


[snip]

> As for the comparison, there are other things than
> file contents which you can use to minimize the
> amount of time you spend actually comparing files. See the
> man page for stat() and lstat().



Why not just write the code for him? If that isn't enough of
a hint... :-)

--
Randy Howard (2reply remove FOOBAR)
Måns Rullgård

2004-11-24, 4:01 pm

Nick Landsberg <SPAMhukolauTRAP@SPAMworldnetTRAP.att.net> writes:

> As for the comparison, there are other things than
> file contents which you can use to minimize the
> amount of time you spend actually comparing files. See the
> man page for stat() and lstat(). What information
> returned by these calls can you use to determine
> definitively that the file is different?


stat() can also return something that can immediately tell that the
files are identical. Figuring out what it is, is left as an exercise
to the OP.

--
Måns Rullgård
mru@inprovide.com
Randy Howard

2004-11-24, 4:01 pm

In article <slrncq9ceh.4jg.stephane.chazelas@spam.is.invalid>,
this.address@is.invalid says...
> (note that to walk a directory tree, there's the standard C
> library function nftw(3))


Since when is nftw part of the C standard library?

Since the OP is taking a course on UNIX programming, he might
be able to use it, but don't mix terms like that.

--
Randy Howard (2reply remove FOOBAR)
Heny Townsend

2004-11-24, 4:01 pm

Måns Rullgård wrote:
> Nick Landsberg <SPAMhukolauTRAP@SPAMworldnetTRAP.att.net> writes:
>
>
>
>
> stat() can also return something that can immediately tell that the
> files are identical. Figuring out what it is, is left as an exercise
> to the OP.


Right, but being identical and having identical content (the homework
problem) are different things.


--
Henry Townsend
Måns Rullgård

2004-11-24, 4:01 pm

Heny Townsend <henry.townsend@not.here> writes:

> Måns Rullgård wrote:
>
> Right, but being identical and having identical content (the homework
> problem) are different things.


Being identical be necessity means having identical content, so it can
save some comparing.

--
Måns Rullgård
mru@inprovide.com
Stephane CHAZELAS

2004-11-24, 4:01 pm

2004-11-24, 16:50(+00), Randy Howard:
> In article <slrncq9ceh.4jg.stephane.chazelas@spam.is.invalid>,
> this.address@is.invalid says...
>
> Since when is nftw part of the C standard library?

[...]

Sorry, I probably should have said the standard Unix C library
as an XSI extension
(http://www.opengroup.org/onlinepubs...tions/nftw.html)

I don't know how widely it is supported, but I know it is at
least by Solaris, Linux, AIX and FreeBSD.

--
Stephane
Roger Leigh

2004-11-24, 4:01 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Heny Townsend <henry.townsend@not.here> writes:

> Måns Rullgård wrote:
>
> Right, but being identical and having identical content (the homework
> problem) are different things.


Not entirely, if you want to check for hard links:

struct stat a, b;
....
if (a.st_dev == b.st_dev && a.st_ino == b.st_ino)
printf("The two files are identical\");


- --
Roger Leigh
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
Debian GNU/Linux http://www.debian.org/
GPG Public Key: 0x25BFB848. Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFBpMxrVcFcaSW/uEgRAqE/AKC/CuJhf+KV6VRq4iIyNTFH7jVzAACfY7sT
Pm6AKM6oU3b0xR38HezBe4I=
=bpGg
-----END PGP SIGNATURE-----
Nick Landsberg

2004-11-24, 4:01 pm

Roger Leigh wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>=20
> Heny Townsend <henry.townsend@not.here> writes:
>=20
>=20
>=20
>=20
> Not entirely, if you want to check for hard links:
>=20
> struct stat a, b;
> ...
> if (a.st_dev =3D=3D b.st_dev && a.st_ino =3D=3D b.st_ino)
> printf("The two files are identical\");
>=20


Not if they're on a different file system.

The question was whether or not they
are the same file (which inode number would
give if the inode was on the same file
system). A hard link just creates a
second (third, etc.) name for the same file.
I guess, by definition, a file is identical
to itself, no matter what the name.

Now, philosophically (and probably outside
the scope of the OP's homework assignment),
the question of soft links comes in.
(This may very well be debating how many
angels can dance on the head of a pin, but
I thought I'd throw it out anyway:)
Is a soft (symbolic) link to a file
identical to the original file?
I could argue either way depending on how
many beers I'd had.

NPL

--=20
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
Roger Leigh

2004-11-24, 4:01 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nick Landsberg <SPAMhukolauTRAP@SPAMworldnetTRAP.att.net> writes:
> Roger Leigh wrote:
>
> Not if they're on a different file system.


It will return false in that case (since st_dev will differ), but this
would only be an optimisation to save doing an actual comparison, not
a replacement for the full comparison itself.

> identical to the original file?
> I could argue either way depending on how
> many beers I'd had.


I guess this depends on the exact requirements of the OP's assignment.
(Personally, I'd add a command-line option to select between them.)


Regards,
Roger

- --
Roger Leigh
Printing on GNU/Linux? http://gimp-print.sourceforge.net/
Debian GNU/Linux http://www.debian.org/
GPG Public Key: 0x25BFB848. Please sign and encrypt your mail.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 <http://mailcrypt.sourceforge.net/>

iD8DBQFBpNRTVcFcaSW/uEgRAs+oAKCTcl+lp9G/xPA/q8j9D8KrivFuJwCg2w4P
93e7LO6P82XGjIVXVT6VnxU=
=K632
-----END PGP SIGNATURE-----
Stefan Scholl

2004-11-24, 4:01 pm

On 2004-11-24 16:27:52, dawn wrote:

> I am now working on a C program under Unix.


Dann solltest Du aber auch in einer Unix-Newsgroup nachfragen.
Pascal Bourguignon

2004-11-24, 4:01 pm

Nick Landsberg <SPAMhukolauTRAP@SPAMworldnetTRAP.att.net> writes:
> The question was whether or not they
> are the same file (which inode number would
> give if the inode was on the same file
> system). A hard link just creates a
> second (third, etc.) name for the same file.
> I guess, by definition, a file is identical
> to itself, no matter what the name.
>
> Now, philosophically (and probably outside
> the scope of the OP's homework assignment),
> the question of soft links comes in.
> (This may very well be debating how many
> angels can dance on the head of a pin, but
> I thought I'd throw it out anyway:)
> Is a soft (symbolic) link to a file
> identical to the original file?
> I could argue either way depending on how
> many beers I'd had.


Obviously, adding an option to follow symbolic-links would motivate
additionnal points.

--
__Pascal Bourguignon__ http://www.informatimago.com/
The world will now reboot; don't bother saving your artefacts.
Måns Rullgård

2004-11-24, 4:01 pm

Nick Landsberg <SPAMhukolauTRAP@SPAMworldnetTRAP.att.net> writes:

>
> Not if they're on a different file system.


In that case, the st_dev values will be different.

> The question was whether or not they
> are the same file (which inode number would
> give if the inode was on the same file
> system). A hard link just creates a
> second (third, etc.) name for the same file.
> I guess, by definition, a file is identical
> to itself, no matter what the name.


Yes, so there can be some time saved by not comparing it to itself.

> Now, philosophically (and probably outside
> the scope of the OP's homework assignment),
> the question of soft links comes in.
> (This may very well be debating how many
> angels can dance on the head of a pin, but
> I thought I'd throw it out anyway:)
> Is a soft (symbolic) link to a file
> identical to the original file?


It depends on how you access it. Using open(), they are the same, but
using lstat() or readlink() they are not. Are there any other system
calls that do not follow links (except unlink(), of course). The
property of being identical to some other file is volatile for
symlinks, though. There are several ways, such as mount(), to change
the actual file pointed to, without ever touching a disk.

In the OP's case, he didn't specify whether symlinks were to be
followed when checking for file with identical contents. If symlinks
are not followed, should the targets of symlinks be compared? In that
case, when are relative symlinks considered equal?

--
Måns Rullgård
mru@inprovide.com
Heiko

2004-11-25, 8:59 am

dawn wrote:

> I am now working on a C program under Unix.
> The requirement for the program is that:
> A file name is passed to program as a parameter. The program will
> Find files under a specified directory. The matched file must have the
> same content with the given file. It does not matter whether the
> filenames are the same.
>
> It is easy to find file that has the same name with given file, but
> may be hard to find the files that with the same content. In my
> knowledge, i give two solutions:
>
> 1) Go throught the directory and its sub direcotry tree, and when
> meeting a file, Use the stand C library function to open that file and
> the given file, and then compare those contents in buffers to see if
> they are the same.


To optimize this, you could first try to find out if it's worth the
effort comparing the actual contents by:
- Checking if they are hard links to the same inode. If so, they'll
definately have the same contents.
- If their sizes are not equal, they definately do not have the same
contents.
- If the sizes are equal, you can start checking byte-by-byte. At the
first byte that differs, you know they are not the same.
- Use mmap() or the low-level open()/read() with large buffers instead
of the stdio functions.

I think using checksums will result in poor performance, because you'll
then be reading every byte of every file.

> 2) Go through the directory and its sub deirectroy tree, and when
> meeting a file, execute the system shell command "diff" to compare the
> two files to see if they are the same in content.


"diff" is a shell-utility. Calling this from your C-program, is a bit
like "cheating", because then comparing is not actually done by your
program.

If that is not a problem, use "cmp" instead, which is faster because it
does only tell *if* the files are different, whereas "diff" also figures
out *where* the files differ.


Heiko
Ian Pellew

2004-11-25, 3:56 pm

Hi all;

I got use to using `sdiff` as a matter of course rather than diff.
It is capable of showing the differences as well.

Regards
Ian
Martin Carpenter

2004-11-26, 4:09 pm


"Heiko" <heiko.noordhof_A_xs4all.nl> wrote:

Some good pragmatic advice in this thread. I thought I'd throw out something
more esoteric that might, say, appeal to a comp.sci prof. reading 237
different implementations of cmp(1):

> I think using checksums will result in poor performance,
> because you'll then be reading every byte of every file.


That was my first thought, too (hashing 3TB files is going to be pretty
ugly). How about "whitnesses to similarity"? Given two files n bytes long,
comparing how many (m, say) random bytes would give x% confidence that the
files are the same? (Cf. birthday theorem?).

What can we say about m for increasing n with fixed x? (No, I'm not about to
do Dawn's homework ;-)).

Of course, I'm assuming truly random read access, rather than O(n) s() to
byte i of any given file. Humour me with a vast RAM-disk?



Geoff Clare

2004-11-26, 4:09 pm

Måns Rullgård <mru@inprovide.com> wrote, on Wed, 24 Nov 2004:

>
> It depends on how you access it. Using open(), they are the same, but
> using lstat() or readlink() they are not. Are there any other system
> calls that do not follow links (except unlink(), of course).


Yes - rmdir() and rename() do not follow links. Also mkdir() and
mkfifo() don't, in the sense that if a symbolic link exists but
its target doesn't then they fail with EEXIST rather than creating
a directory/FIFO as the target of the symlink. (Contrast this with
open() using O_CREAT, which would create a plain file as the target).

One oddity is link(). POSIX requires it to follow symlinks, but
historically some versions did not. Solaris can behave either way
depending on how the program calling link() was compiled.

--
Geoff Clare <netnews@gclare.org.uk>

Måns Rullgård

2004-11-26, 4:09 pm

Geoff Clare <geoff@clare.See-My-Signature.invalid> writes:

> Måns Rullgård <mru@inprovide.com> wrote, on Wed, 24 Nov 2004:
>
>
> Yes - rmdir() and rename() do not follow links.


It would make no sense at all for those to follow links, as they do
not operate on the file named, but rather the directory containing it.

> Also mkdir() and mkfifo() don't, in the sense that if a symbolic
> link exists but its target doesn't then they fail with EEXIST rather
> than creating a directory/FIFO as the target of the symlink.
> (Contrast this with open() using O_CREAT, which would create a plain
> file as the target).


This is more likely to cause surprises. Is there any rationale for
this difference?

--
Måns Rullgård
mru@inprovide.com
Stephane CHAZELAS

2004-11-26, 4:09 pm

2004-11-26, 14:52(+00), Geoff Clare:
[...] [about symlkinks as arguments to syscalls]
> (Contrast this with open() using O_CREAT, which would create a
> plain file as the target).


Not if there's a O_EXCL flag, though.

--
Stephane

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com