For Programmers: Free Programming Magazines  


Home > Archive > Fortran > January 2006 > Debugging stochastic programs









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Debugging stochastic programs
AN O'Nymous

2006-01-23, 7:57 am

Hello. I have a stochastic program that crashes with a sigsev (in
Linux) after running for about a day. The exact part of the
optimisation process the program crashes in is variable (surprising,
considering that all RNG calls are actually deterministic).

Any ideas on what might be causing it, or how I can find the bug?

I have not been able to reproduce the error with shorter optimisation
runs.

Thanks.

Dr Ivan D. Reid

2006-01-23, 7:57 am

On 23 Jan 2006 02:39:21 -0800, AN O'Nymous <a_n_onymous80@yahoo.co.uk>
wrote in <1138012761.297101.280680@g49g2000cwa.googlegroups.com>:
> Hello. I have a stochastic program that crashes with a sigsev (in
> Linux) after running for about a day. The exact part of the
> optimisation process the program crashes in is variable (surprising,
> considering that all RNG calls are actually deterministic).


But are you starting with the same seeeding each time?

> Any ideas on what might be causing it, or how I can find the bug?


Turn on bounds checking, and look very carefully for mismatched
sub-program parameters (esp. if sections are compiled separately).

> I have not been able to reproduce the error with shorter optimisation
> runs.


Are you overflowing an accumulated sum (tho' that's not likely to
directly cause a sigsegv, I think)?

--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
Arjen Markus

2006-01-23, 7:57 am

Another possibility: uninitialised variables. Their contents will be
random (not pseudorandom) and it is likely to vary with each run.

Regards,

Arjen

AN O'Nymous

2006-01-23, 7:57 am


Dr Ivan D. Reid wrote:

> But are you starting with the same seeeding each time?


Yes. I will try to leave running two separate runs with exactly the
same input & database files. The database files were slightly different
for both runs where the sigsegv occurred, and I suppose if the number
of random calls was off by just 1, the seeds differ after that.


> Turn on bounds checking, and look very carefully for mismatched
> sub-program parameters (esp. if sections are compiled separately).


It is one program compiled as a whole, using Intel FC 8.0 for Linux.
How do I turn on bounds checking?


> Are you overflowing an accumulated sum (tho' that's not likely to
> directly cause a sigsegv, I think)?


I don't think so. I haven't noticed this bug with other runs of similar
size. I'm guessing a specific random generated number that is causing
problems with the code is somehow responsible.

I don't have such a line in my code but an example of what I'm saying
would be:
Call rng(random,seed)

temp = 1.0/random

The above code would be fine for all values except for a particular
random value of 0.

How do people normally debug a stochastic program that only crashes
after 1 day and is otherwise fine in 99.9% of the other runs?

Michael Metcalf

2006-01-23, 7:57 am


"AN O'Nymous" <a_n_onymous80@yahoo.co.uk> wrote in message
news:1138020013.152560.292130@g43g2000cwa.googlegroups.com...
>
> I don't have such a line in my code but an example of what I'm saying
> would be:
> Call rng(random,seed)
>
> temp = 1.0/random
>
> The above code would be fine for all values except for a particular
> random value of 0.
>

I've known of a program that crashed when an RNG returned 1.0, although,
like random_number, it was supposed to return a value only beween 0 and <
1.0. Which RNG are you using?

Regards,

Mike Metcalf


Patrick Begou

2006-01-23, 7:04 pm

AN O'Nymous wrote:
> Hello. I have a stochastic program that crashes with a sigsev (in
> Linux) after running for about a day. The exact part of the
> optimisation process the program crashes in is variable (surprising,
> considering that all RNG calls are actually deterministic).
>
> Any ideas on what might be causing it, or how I can find the bug?
>
> I have not been able to reproduce the error with shorter optimisation
> runs.
>
> Thanks.
>


What about compiling with -g option and runing the application ?
It will run slower but generate a core file when the program crashes.
With the debugger, you can then access to the subroutine/function line
where the sugesev occure and investigate?

in the directory where the core file exist:
# gdb path/to/the/executable
(gdb) help where
Print backtrace of all stack frames, or innermost COUNT frames.
With a negative argument, print outermost -COUNT frames.
Use of the 'full' qualifier also prints the values of the local variables.
(gdb) where
Dr Ivan D. Reid

2006-01-23, 7:04 pm

On 23 Jan 2006 04:40:13 -0800, AN O'Nymous <a_n_onymous80@yahoo.co.uk>
wrote in <1138020013.152560.292130@g43g2000cwa.googlegroups.com>:

> Dr Ivan D. Reid wrote:


[color=darkred]
> It is one program compiled as a whole, using Intel FC 8.0 for Linux.
> How do I turn on bounds checking?


I looked at the on-line documentation for V9.0; probably -CB or
-check bounds in your compiler flags. If you are using arrays heavily, this
can significantly increase run times, unfortunately.

> I don't have such a line in my code but an example of what I'm saying
> would be:
> Call rng(random,seed)


> temp = 1.0/random


> The above code would be fine for all values except for a particular
> random value of 0.


Yes, of course your documentation should note whether the RNG
returns [0,1) (the usual) or (0,1] or [0,1] (less usual). I had to change
my logic once in a MC for just this reason, tho' it was log(0.0) rather than
dividing by zero.

If you think it's some particular value and you can isolate it to
one particular routine, you could try excercising that routine with all
possible random numbers in sequence (typically (0 to 2^24-1)/2^24 for single
precision).

> How do people normally debug a stochastic program that only crashes
> after 1 day and is otherwise fine in 99.9% of the other runs?


With great care and patience.

--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
Kevin G. Rhoads

2006-01-23, 7:04 pm

>It is one program compiled as a whole, using Intel FC 8.0 for Linux.

Have you made sure it is compiling without warnings? Have you enabled all
warnings and checked them out? Has the code been compiled with any OTHER
compilers? Is it warning-free with them as well?

Intermittents are a b*tch; it is good to eliminate all possible sources of
glitchiness when they crop up. Finding and addressing all warnings is ONE
thing to try.
AN O'Nymous

2006-01-23, 7:04 pm

Thanks guys. I found the bug. It turned out that I forgot to update a
key array size as the size of the optimisation run increased. After a
while, the program started making calls beyond the allocated array
size. This was why the bug wasn't evident for short runs.

What I found surprising was that the program didn't crash immediately
after calling beyond the (insufficiently) defined array size. It
actually chugged on a bit and then crashed somewhere later.

Is this because of statistical chance that the stochastic program
didn't call beyond the array bounds, or are Fortran programs known to
"survive" for a bit despite errorneous array calls?

Richard E Maine

2006-01-23, 7:04 pm

AN O'Nymous <a_n_onymous80@yahoo.co.uk> wrote:

> Is this because of statistical chance that the stochastic program
> didn't call beyond the array bounds,


Possibly a contributing factor, but...

> or are Fortran programs known to
> "survive" for a bit despite errorneous array calls?


Yes, as are programs in most languages. (C is notoriously bad in this
area). That's why bounds checking options exist in most (or maybe even
all) Fortran compilers. The whole point of a bounds checking option is
to generate an error message as soon as it happens instead of silently
causing corruption that shows up sometime later in some nonobvious way.

Without bounds checking, an out-of-bounds reference typically just
refers to some other location in memory. If the reference is only
slighly out of bounds, that location will probably exist and reading or
writing to it will "work". The problems don't generally show up until
later, because you just changed the value of something else which
shouldn't have been changed. SOmtimes the problems don't even show up at
all, for example if you happened to change the value of a variable that
doesn't matter because it isn't subsequently used.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain| experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
James Parsly

2006-01-23, 7:04 pm

It depends on what you mean by "survive", but when a Fortran program
accesses
an array out of its bounds, it will simply read or write whatever is in the
memory
location at the offset that it computes from the beginning of the array.
This is not something that
is normally caught in a release version of your code.

It's possible to get lucky and have that location fall somewhere where
nothing bad will happen. For example
the location might be part of some other array that you were already done
with. If you store and retrieve data
from such a location, nothing overtly bad will happen.

On the other hand, suppose that you weren't done with that other location.
Your program has just changed
data that will be used elsewhere in your program, and all sorts of bad
things might happen.

Or instead of a data location, you might have changed some executable code.
You could even be
changing code or data locations belonging to entirely different programs,
possibly even part of the operating system.

The good news compilers usually have a 'debug' option that you can turn on
that will check your array bounds.
This makes the program bigger and slower, which is why its not normally
turned on for a release version.




"AN O'Nymous" <a_n_onymous80@yahoo.co.uk> wrote in message
news:1138044207.912932.195760@o13g2000cwo.googlegroups.com...
> Thanks guys. I found the bug. It turned out that I forgot to update a
> key array size as the size of the optimisation run increased. After a
> while, the program started making calls beyond the allocated array
> size. This was why the bug wasn't evident for short runs.
>
> What I found surprising was that the program didn't crash immediately
> after calling beyond the (insufficiently) defined array size. It
> actually chugged on a bit and then crashed somewhere later.
>
> Is this because of statistical chance that the stochastic program
> didn't call beyond the array bounds, or are Fortran programs known to
> "survive" for a bit despite errorneous array calls?
>



Dr Ivan D. Reid

2006-01-23, 7:04 pm

On 23 Jan 2006 11:23:27 -0800, AN O'Nymous <a_n_onymous80@yahoo.co.uk>
wrote in <1138044207.912932.195760@o13g2000cwo.googlegroups.com>:
> Thanks guys. I found the bug. It turned out that I forgot to update a
> key array size as the size of the optimisation run increased. After a
> while, the program started making calls beyond the allocated array
> size. This was why the bug wasn't evident for short runs.


Pretty much as my first guess expected then.

> What I found surprising was that the program didn't crash immediately
> after calling beyond the (insufficiently) defined array size. It
> actually chugged on a bit and then crashed somewhere later.


> Is this because of statistical chance that the stochastic program
> didn't call beyond the array bounds, or are Fortran programs known to
> "survive" for a bit despite errorneous array calls?


I think Richard and James have adequately addressed this -- it's
a lottery as to what you might be changing in your data space. _Some_
architectures will throw up a segment violation immediately though, if
you try to modify an address that falls into a code segment; others don't
(necessarily) make the distinction.

--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
glen herrmannsfeldt

2006-01-23, 9:56 pm

Richard E Maine <nospam@see.signature> wrote:
(snip)

[color=darkred]
> Yes, as are programs in most languages. (C is notoriously bad in this
> area). That's why bounds checking options exist in most (or maybe even
> all) Fortran compilers. The whole point of a bounds checking option is
> to generate an error message as soon as it happens instead of silently
> causing corruption that shows up sometime later in some nonobvious way.


The usual C implementation of dynamic memory stores the length
and other allocation information just before the user memory.
Storing to element -1 or -2 will usually destroy that information,
which will cause a crash in the next few calls to malloc()
or free().

Fortran might be a little less likely to use an implementation
like that, so it might last a little longer.

-- glen
glen herrmannsfeldt

2006-01-24, 3:57 am

Arjen Markus wrote:
> Another possibility: uninitialised variables. Their contents will be
> random (not pseudorandom) and it is likely to vary with each run.


In years past, programs would be loaded into storage with whatever
was left from the previous user. For security reasons, multi-user
systems can't do that anymore. Memory is usually cleared to zero
before program loading and before pages are supplied to user programs.

Any non-zero memory is likely related to the user program itself.
The stack will tend to have what was previously left by your program,
not some other user's program.

There are still possibilities for stochastic behavior, but not near
as much as there used to be.

-- glen

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2009 codecomments.com