Home > Archive > Unix Programming > November 2007 > Detecting network file system
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Detecting network file system
|
|
| Henrik Goldman 2007-11-21, 8:09 am |
| Is it possible in any simple way to find out from C/C++ if a path e.g. HOME
is located on a local or network filesystem?
This is something that would be relevant for most unix systems... but so far
I havn't seen any simple api.
Thanks.
-- Henrik
| |
|
| On Wed, 21 Nov 2007 11:37:09 +0100, Henrik Goldman wrote:
> Is it possible in any simple way to find out from C/C++ if a path e.g.
> HOME is located on a local or network filesystem?
>
> This is something that would be relevant for most unix systems... but so
> far I havn't seen any simple api.
>
For networked files, struct stat.st_dev will contain "faked" values. I
don't know of a standard method of checking the validity of
{major,minor}, though. Maybe just a mknod - probe and testing the
resulting error/errno could give you sufficient information.
HTH,
AvK
| |
| Eric Sosman 2007-11-21, 8:09 am |
| Henrik Goldman wrote:
> Is it possible in any simple way to find out from C/C++ if a path e.g. HOME
> is located on a local or network filesystem?
>
> This is something that would be relevant for most unix systems... but so far
> I havn't seen any simple api.
Can you give an exact definition of "local" and "network"
file system? What with fibre channel and iSCSI and SAN's and
multi-pathed switched storage fabrics and pooled storage in
blade servers, the distinction seems to get rather blurry ...
I guess the real question is: Why do you care? What do
you intend to do differently depending on the answer? Maybe
there's a more direct way to get at the operational difference
that you care about than by asking what seems like a surrogate
question.
--
Eric Sosman
esosman@ieee-dot-org.invalid
| |
| Henrik Goldman 2007-11-21, 7:10 pm |
|
> Can you give an exact definition of "local" and "network"
> file system? What with fibre channel and iSCSI and SAN's and
> multi-pathed switched storage fabrics and pooled storage in
> blade servers, the distinction seems to get rather blurry ...
I would rather say local and remote filesystem then.
> I guess the real question is: Why do you care? What do
> you intend to do differently depending on the answer? Maybe
> there's a more direct way to get at the operational difference
> that you care about than by asking what seems like a surrogate
> question.
>
I spent 3 days on a bug with a customer who said an application kept
crashing.
However I could simply not re-produce this. After A LOT of testing it turned
out that they were running multiple copies of the same application. This
application was saving a local cache with some information for every
execution. When they were running 5-10 copies in parallel the software was
working perfectly fine in our local setting while failed at theirs.
The file-locking mechanism provided by the OS (linux 2.4 kernel) didn't work
very well so some of these application copies could fail and the cache could
be corrupted.
I never managed to re-produce the problem but when I added a verification
check to ensure the file would only be saved when there was any actual
changes (and not everytime as it was until now) then the customer has had no
more problems since then.
In the end I think it boils down to ensuring the file is locked before
overwriting it and ensuring that other instances of the application won't
get a corrupted copy.
The customer actually reproduced the problem using a simple
fopen/fgets/fprintf application and said that it failed more often on their
network filesystem versus their local machine.
-- Henrik
| |
| Eric Sosman 2007-11-21, 7:10 pm |
| Henrik Goldman wrote On 11/21/07 15:22,:
>
>
> I would rather say local and remote filesystem then.
>
>
>
>
> I spent 3 days on a bug with a customer who said an application kept
> crashing.
> However I could simply not re-produce this. After A LOT of testing it turned
> out that they were running multiple copies of the same application. This
> application was saving a local cache with some information for every
> execution. When they were running 5-10 copies in parallel the software was
> working perfectly fine in our local setting while failed at theirs.
> The file-locking mechanism provided by the OS (linux 2.4 kernel) didn't work
> very well so some of these application copies could fail and the cache could
> be corrupted.
>
> I never managed to re-produce the problem but when I added a verification
> check to ensure the file would only be saved when there was any actual
> changes (and not everytime as it was until now) then the customer has had no
> more problems since then.
>
> In the end I think it boils down to ensuring the file is locked before
> overwriting it and ensuring that other instances of the application won't
> get a corrupted copy.
> The customer actually reproduced the problem using a simple
> fopen/fgets/fprintf application and said that it failed more often on their
> network filesystem versus their local machine.
Okay, so what you'd really like to know isn't local vs.
remote, but "Does the *&^%$#@! locking work right?" I'm
definitely not a specialist in the different flavors of
file systems, but my impression (from reading this group,
mostly) is that different file systems implement different
semantics for locks even if the API looks the same -- and,
of course, even file systems are not bug-free ...
It sounds like you'd need something considerably more
involved than a local vs. remote determination, like a list
of which combinations of file system, implementation, and
protocol do and don't work. It could be a nightmare to keep
up-to-date, too: "Everything was fine with Version 6.3, but
broke when I upgraded to 7.0 ..."
Maybe the acid test could be something like the simple
program your customer used. At installation time or some
similar convenient moment, you'd display "Scrutinizing the
omens; please wait" while running the test, and if it failed
you'd say "This file system has bad feng shui; please choose
another." (This presumes that you can get the test to "fail
reliably" and in a reasonable amount of time.)
The only other thing I can suggest is to jettison the
ineffective locking API's and roll your own substitutes
using cruder methods. The file systems I've seen (again, I
make no claim to specialist knowledge) have been pretty good
at preserving atomicity for the creation and deletion of files,
so you could rely on a HANDS_OFF file and the O_EXCL flag.
It's not suited for rapid micro-updates but may suffice if
the cache writes come in relatively infrequent batches. It's
also a pain if a process creates HANDS_OFF and then crashes
before removing it -- but maybe it'll be easier to recover
from "everybody stopped" than from "everybody's data turned
to oatmeal."
Sounds like a nasty problem. Good luck with it!
--
Eric.Sosman@sun.com
| |
| Rainer Weikusat 2007-11-22, 8:08 am |
| "Henrik Goldman" <henrik_goldman@mail.tele.dk> writes:
[...]
> However I could simply not re-produce this. After A LOT of testing it turned
> out that they were running multiple copies of the same application. This
> application was saving a local cache with some information for every
> execution. When they were running 5-10 copies in parallel the software was
> working perfectly fine in our local setting while failed at theirs.
> The file-locking mechanism provided by the OS (linux 2.4 kernel) didn't work
> very well so some of these application copies could fail and the cache could
> be corrupted.
What makes you believe that this would be a defect of the locking code
in the kernel and not misusage of the locking mechanism in the kernel?
And which mechansim were you using?
[...]
> The customer actually reproduced the problem using a simple
> fopen/fgets/fprintf application and said that it failed more often on their
> network filesystem versus their local machine.
The customer managed to write a buggy application which failed? And
what precisely is this supposed to demonstrate? 'Failing more often
when used over the network' could hint at a race condition in the
code.
| |
| Rainer Weikusat 2007-11-22, 8:08 am |
| Eric Sosman <Eric.Sosman@sun.com> writes:
> Henrik Goldman wrote On 11/21/07 15:22,:
[...]
[...]
[color=darkred]
> Okay, so what you'd really like to know isn't local vs.
> remote, but "Does the *&^%$#@! locking work right?"
I dare say it does.
> I'm definitely not a specialist in the different flavors of
> file systems, but my impression (from reading this group,
> mostly) is that different file systems implement different
> semantics for locks even if the API looks the same -- and,
> of course, even file systems are not bug-free ...
The semantics for POSIX/ UNIX(*) record locks are defined by the
UNIX(*)-standard. Additionally, these locks are kernel objects and
usually involve no filesystem specific code. Network-mounted file
systems are the obvious exception to this, because the server
exporting the filesystem needs to be consulted to determine the fate
of a locking request.
There is a lot of anecdotical evidence floating around the net that
the locking support of some antiquated NFS implementations 'was
buggy'. To date, I have not heard anything specific on this, so it is
entirely conceivable that these rumour either actually resulted from
misusing the locking mechanisms, too, or that the actually affected
implementations have stopped to be used by anyone except historian ten
to fifteen years ago (anyone having details on this?).
[...]
> The only other thing I can suggest is to jettison the
> ineffective locking API's and roll your own substitutes
> using cruder methods.
Another thing which is entirely conceivable is that there never was
any problem with this, except that the functionality wasn't already
part of 7th edition UNIX(*), and people who did not miss it back then
try to make it again disappear by badmouthing, cf 'Why you should
never use threads'.
As I wrote, I would greatly appreciate any actual information on this
mythical issue and I am really tired of chasing left-over lock files
or debugging processes which don't start because the pid recorded in
the predecessor incarnation of the the process which left the stale
lock file lying around meanwhile belongs to a completely different
process etc.
| |
| Rainer Weikusat 2007-11-22, 8:08 am |
| Eric Sosman <Eric.Sosman@sun.com> writes:
> Henrik Goldman wrote On 11/21/07 15:22,:
[...]
[...]
[color=darkred]
> Okay, so what you'd really like to know isn't local vs.
> remote, but "Does the *&^%$#@! locking work right?"
I dare say it does.
> I'm definitely not a specialist in the different flavors of
> file systems, but my impression (from reading this group,
> mostly) is that different file systems implement different
> semantics for locks even if the API looks the same -- and,
> of course, even file systems are not bug-free ...
The semantics for POSIX/ UNIX(*) record locks are defined by the
UNIX(*)-standard. Additionally, these locks are kernel objects and
usually involve no filesystem specific code. Network-mounted file
systems are the obvious exception to this, because the server
exporting the filesystem needs to be consulted to determine the fate
of a locking request.
There is a lot of anecdotical evidence floating around the net that
the locking support of some antiquated NFS implementations 'was
buggy'. To date, I have not heard anything specific on this, so it is
entirely conceivable that these rumour either actually resulted from
misusing the locking mechanisms, too, or that the actually affected
implementations have stopped to be used by anyone except historian ten
to fifteen years ago (anyone having details on this?).
[...]
> The only other thing I can suggest is to jettison the
> ineffective locking API's and roll your own substitutes
> using cruder methods.
Another thing which is entirely conceivable is that there never was
any problem with this, except that the functionality wasn't already
part of 7th edition UNIX(*), and people who did not miss it back then
try to make it again disappear by badmouthing, cf 'Why you should
never use threads'.
As I wrote, I would greatly appreciate any actual information on this
mythical issue and I am really tired of chasing left-over lock files
or debugging processes which don't start because the pid recorded in
by predecessor incarnation of the the process which left the stale
lock file lying around meanwhile belongs to a completely different
process etc.
| |
| Juha Laiho 2007-11-22, 7:09 pm |
| Rainer Weikusat <rweikusat@mssgmbh.com> said:
>Eric Sosman <Eric.Sosman@sun.com> writes:
>
>[...]
>
>
>[...]
>
>
>I dare say it does.
Most of the time, yes, at least for certain combinations on
operating systems. Is that enough conditionals for a single
sentence?
>There is a lot of anecdotical evidence floating around the net that
>the locking support of some antiquated NFS implementations 'was
>buggy'. To date, I have not heard anything specific on this, so it is
>entirely conceivable that these rumour either actually resulted from
>misusing the locking mechanisms, too, or that the actually affected
>implementations have stopped to be used by anyone except historian ten
>to fifteen years ago (anyone having details on this?).
I remember I did patch (or force at compile time) several mail client
implementations to use locking other than flock() to lock mail spool
files. But then, this was in the timeframe you indicate - a decade ago,
and on HP-UX, which I gather wasn't a superb NFS implementation at the
time.
As for more recent sigtings, Linux NFS FAQ seems to contain a nice
discussion over the topic: http://nfs.sourceforge.net/#faq_d10
.... so, looks like flock() traditionally has not suppored NFS on any
platform, but slowly the kernel code of various OSes is starting to
use other locking mechanisms to provide flock() also for NFS-mounted
file systems. However, based on the NFS FAQ article, it looks like
this could even cause bugs in some multi-OS environments where
different OSes use different underlying locking techniques.
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)
| |
| Rainer Weikusat 2007-11-22, 7:09 pm |
| Juha Laiho <Juha.Laiho@iki.fi> writes:
> Rainer Weikusat <rweikusat@mssgmbh.com> said:
>
> Most of the time, yes, at least for certain combinations on
> operating systems. Is that enough conditionals for a single
> sentence?
It is a soap bubble. Everything which 'works' can only work most of
the time. The same is true for 'certain combinations'. Each
combination which works is a certain combination.
>
> I remember I did patch (or force at compile time) several mail client
> implementations to use locking other than flock() to lock mail spool
> files. But then, this was in the timeframe you indicate - a decade ago,
> and on HP-UX, which I gather wasn't a superb NFS implementation at the
> time.
>
> As for more recent sigtings, Linux NFS FAQ seems to contain a nice
> discussion over the topic: http://nfs.sourceforge.net/#faq_d10
*shrug* ... flock is the 4.2BSD locking primitive. It is not
standardized anywhere and consequently, works differently on
different systems. Eg SunOS 4.1.3 documents it as 'locally visible
only', SunOS 5 documents it has 'use with any of the system libraries
or in multi-threaded applications is unsupported', the Linux-flock
has changed a couple of times and HP-UX (11) does not have it at all.
The sentence "The semantics for POSIX/ UNIX(*) record locks are
defined by the UNIX(*)-standard." was intended to communicate that I
was writing about what it was talking about and not about legacy
interfaces coming from Berkeley UNIX(*) which never gained enough
traction to be consistently available outside of Berkeley UNIX(*).
| |
| fjblurt@yahoo.com 2007-11-22, 7:09 pm |
| On Nov 22, 10:38 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> *shrug* ... flock is the 4.2BSD locking primitive. It is not
> standardized anywhere and consequently, works differently on
> different systems. Eg SunOS 4.1.3 documents it as 'locally visible
> only', SunOS 5 documents it has 'use with any of the system libraries
> or in multi-threaded applications is unsupported', the Linux-flock
> has changed a couple of times and HP-UX (11) does not have it at all.
>
> The sentence "The semantics for POSIX/ UNIX(*) record locks are
> defined by the UNIX(*)-standard." was intended to communicate that I
> was writing about what it was talking about and not about legacy
> interfaces coming from Berkeley UNIX(*) which never gained enough
> traction to be consistently available outside of Berkeley UNIX(*).
To be clear, then, it's the fcntl(2) locking API you're talking about?
I agree that, in this day and age, the original poster should suspect
a bug in his code, or a system configuration error, before concluding
that file locking on network file systems is impossible. He should
make sure that he's using a file locking method that's documented as
supported for network file systems (fcntl and lockf usually are, flock
sometimes is not). If the file system in question is NFS, make sure
lockd is running.
It might be educational to run procmail's configure script, which
tests several file locking methods on your filesystems.
| |
| David Schwartz 2007-11-23, 7:09 pm |
| On Nov 22, 2:56 am, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
> What makes you believe that this would be a defect of the locking code
> in the kernel and not misusage of the locking mechanism in the kernel?
> And which mechansim were you using?
I would expect that the POSIX-standardized locking functions should
work quite properly on any filesystem that isn't more than 8 years
old.
[color=darkred]
> The customer managed to write a buggy application which failed? And
> what precisely is this supposed to demonstrate? 'Failing more often
> when used over the network' could hint at a race condition in the
> code.
Exactly. I can easily write code that fails more often on a dual-core
machine than a single-core machine. All that proves is that I can
write buggy code. It doesn't mean there's anything wrong with dual-
core machines or that they should be avoided. Buggy code should be
avoided.
Now if he had a program that always worked properly on a local machine
and sometimes failed on a network filesystem, that might indicate a
problem in the network filesystem. As is, he's just demonstrated a
property of his program's bugs (and possibly the OP's program's bugs
too).
DS
|
|
|
|
|