For Programmers: Free Programming Magazines  


Home > Archive > Fortran > December 2004 > lock file problems









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author lock file problems
Brett

2004-12-21, 3:59 pm

I am working with a monte carlo simulation that outputs each run
consecutivly to a file and then writes out the statistical summary at
the end. One line is a summary of the summary. This one line is also
written to a summary output file. The file i/o is set up so as to
create a lock file when this second file is opened and written to so as
to prevent file collision. In general, it appears to work.

Further, for a task, many different points in a grid are simulated, so
we use Sun's gridware to have a cluster of machines (about 50) to run
50+ simulations simultaneously. Each grid is about 2000 points. Each
machine is different - some are multiprocessors, some aren't. The
clock speeds of the computers are all different.

The problem is that while I might get 2000 output files, I may only get
XX summary output lines in the summary output file, where XX < 2000.
How many lines are missing seems to be random. As near as I can figure
out, multiple simulations try to write to the same file at the same
exact time, which causes problems. I believe I lose lines when two
executibles try to create the lockfile at the same time. One
executible actually gets it, while the other thinks it gets it, but
doesn't. Sometimes I get an error, sometimes I don't.

If I get an error, it is typically:

open: can't stat file
apparent state: unit 36 named summary.lck
lately writing sequential formatted external IO

which leads me to believe that some form of file collision is taking
place, as the lockfile code has a different error message.

Since the summary line apears in both the output and summary files, and
the output file names are all unique, I have been currently
post-processing the output files and ignoring the summary file.
However, my latest task generated 2 million+ output files and grep'ing
and sed'ing that many files is starting to take time. It would be nice
if I could just get the summary output file to work correctly, but I am
unsure as to how to do that. I have only been doing this for a few
months, so I am learning fortran as I go. The simulation was written
ages ago (>25 years) in f77.

I would think that due to the nature of gridware, the start and stop
times of each simulation is random, although I haven't tested this.
Its hard for me to believe that to executibles would try to set the
lockfile at the same exact time as the hardware on many of the machines
are different. It seems to me that everything is random enough that
this really wouldn't be an issue, but it is.

I am not sure that this is the proper newgroup to ask a question like
this, but I thought that I would start here.

Thanks,
Brett

Janne Blomqvist

2004-12-21, 3:59 pm

In article <1103640978.137982.303150@c13g2000cwb.googlegroups.com>, Brett wrote:
> I am working with a monte carlo simulation that outputs each run
> consecutivly to a file and then writes out the statistical summary at
> the end. One line is a summary of the summary. This one line is also
> written to a summary output file. The file i/o is set up so as to
> create a lock file when this second file is opened and written to so as
> to prevent file collision. In general, it appears to work.


Use of lockfile, ok..

> Further, for a task, many different points in a grid are simulated, so
> we use Sun's gridware to have a cluster of machines (about 50) to run
> 50+ simulations simultaneously. Each grid is about 2000 points. Each
> machine is different - some are multiprocessors, some aren't. The
> clock speeds of the computers are all different.


I assume you're using NFS?

> The problem is that while I might get 2000 output files, I may only get
> XX summary output lines in the summary output file, where XX < 2000.
> How many lines are missing seems to be random. As near as I can figure
> out, multiple simulations try to write to the same file at the same
> exact time, which causes problems. I believe I lose lines when two
> executibles try to create the lockfile at the same time. One
> executible actually gets it, while the other thinks it gets it, but
> doesn't. Sometimes I get an error, sometimes I don't.
>
> If I get an error, it is typically:
>
> open: can't stat file
> apparent state: unit 36 named summary.lck
> lately writing sequential formatted external IO
>
> which leads me to believe that some form of file collision is taking
> place, as the lockfile code has a different error message.


Ah, well. Locking and NFS is a nightmare. AFAIK, there is no portable
(as in standard Fortran) way to make it work correctly.

Anyways, as you have noticed, what you're doing is not safe. There are
two ways to make it work:

1) Newer NFS servers often support file locking, e.g. instead of using
a lock file you can lock the file directly. This functionality can be
accessed with the POSIX syscall fcntl.

2) Use a lockfile. Unfortunately, the straightforward approach doesn't
work, as IIRC (among other things?) file creation over NFS is not
atomic. Paste from the Linux open(2) manpage:

O_EXCL When used with O_CREAT, if the file already exists
it is an error and the open will fail. In this con_
text, a symbolic link exists, regardless of where
its points to. O_EXCL is broken on NFS file sys_
tems, programs which rely on it for performing
locking tasks will contain a race condition. The
solution for performing atomic file locking using a
lockfile is to create a unique file on the same fs
(e.g., incorporating hostname and pid), use link(2)
to make a link to the lockfile. If link() returns
0, the lock is successful. Otherwise, use stat(2)
on the unique file to check if its link count has
increased to 2, in which case the lock is also suc_
cessful.

So, you need the fcntl or alternatively the link/stat system calls as
well as some functionality to generate unique file names. I don't know
if your Fortran processor provides that via some extension. If not, I
guess you can write the functions in C and link that to your Fortran
code.

> I am not sure that this is the proper newgroup to ask a question like
> this, but I thought that I would start here.


If this doesn't help you might want to try comp.protocols.nfs or
something like that.


--
Janne Blomqvist
Dave Thompson

2004-12-27, 3:55 am

On 21 Dec 2004 06:56:18 -0800, "Brett" <NoSpam@grantb.org> wrote:

> I am working with a monte carlo simulation that outputs each run
> consecutivly to a file and then writes out the statistical summary at
> the end. One line is a summary of the summary. This one line is also
> written to a summary output file. The file i/o is set up so as to
> create a lock file when this second file is opened and written to so as
> to prevent file collision. In general, it appears to work.
>
> Further, for a task, many different points in a grid are simulated, so
> we use Sun's gridware to have a cluster of machines (about 50) to run
> 50+ simulations simultaneously. Each grid is about 2000 points. <snip>
> The problem is that while I might get 2000 output files, I may only get
> XX summary output lines in the summary output file, where XX < 2000.
> How many lines are missing seems to be random. As near as I can figure
> out, multiple simulations try to write to the same file at the same
> exact time, which causes problems. <snip>
> Since the summary line apears in both the output and summary files, and
> the output file names are all unique, I have been currently
> post-processing the output files and ignoring the summary file.
> However, my latest task generated 2 million+ output files and grep'ing
> and sed'ing that many files is starting to take time. <snip>


You might try reducing the concentration by having say 100 or so
groups of runs which write to separate summary files; especially, if
the numbers are convenient, as appears, processes (likely?) on one
machine, which probably improves the chance of the lockfile working
right. And even if it doesn't, when a loss does occur you only need to
grep+sed 1/100 or 2/100 or whatever of your details.

Another idea: if the summary you are extracting is (always) at the end
of a largish file, you could tail -1 rather than grep /searchstring/.
That doesn't have to read through the whole file looking for the data.
It does still have to open (and close) each file so if that's the
bottleneck (especially over NFS?) it doesn't help. And standard tail,
including apparently Sun's, requires one invocation per file; if you
can use/get the GNU one, it can do multiple files per invocation.

- David.Thompson1 at worldnet.att.net
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com