For Programmers: Free Programming Magazines  


Home > Archive > Fortran > October 2004 > Exciting new feature for g95









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Exciting new feature for g95
Andy Vaught

2004-10-20, 4:06 pm


I've added a new feature to g95 that I've wanted for a long time.
Living on the northern edge of the Sonoran desert, there are seasonal
monsoons that roll through. Sometimes they get rather close. In case
you've never heard it, thunder gets a special crackle to it when the
bolt is within a couple hundred meters. I usually shut down my
machine down about then, losing any progress a running job has made.

The new feature is a mechanism for resuming jobs. Being a
reasonably lazy person, the file for doing this is the existing unix
core file, which the operating system already writes. A good way of
getting a core file is to use the QUIT signal, which is usually bound
to the control-backslash key. The behaviour of QUIT is the same as
the interrupt signal (usually control-c or delete keys) except that it
dumps a core file if your ulimit allows it.

It's now possible to point a g95-compiled program to a core file and
have it load the state of the running program and resume execution.
For example:

-----------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

b = 0.0
do i=1, 10
do j=1, 3000000
call random_number(a)
a = 2.0*a - 1.0
b = b + sin(sin(sin(a)))
enddo
print *, i, b
enddo
end

andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
1 -464.5689
2 -38.27584
3 -652.6890
4 -597.2142
5 -150.8911
6 -376.1212
Quit (core dumped)
andy@fulcrum:~/g95/g95 % a.out --resume core
7 -1078.404
8 -1444.724
9 -372.3247
10 -934.3513
andy@fulcrum:~/g95/g95 %
-----------------------------------

Open files are reopened:


-----------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

b = 0.0
do i=1, 10
do j=1, 3000000
call random_number(a)
a = 2.0*a - 1.0
b = b + sin(sin(sin(a)))
enddo
print *, i, b
write(10,*) i, b
enddo
end
andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
1 -464.5689
2 -38.27584
3 -652.6890
4 -597.2142
5 -150.8911
Quit (core dumped)
andy@fulcrum:~/g95/g95 % cat fort.10
1 -464.5689
andy@fulcrum:~/g95/g95 % a.out --resume core
6 -376.1212
7 -1078.404
8 -1444.724
9 -372.3247
10 -934.3513
andy@fulcrum:~/g95/g95 % cat fort.10
1 -464.5689
2 -38.27584
3 -652.6890
4 -597.2142
5 -150.8911
6 -376.1212
7 -1078.404
8 -1444.724
9 -372.3247
10 -934.3513
andy@fulcrum:~/g95/g95 %
-----------------------------------

The fort.10 file isn't up to date when the core is dumped, but it is
still buffered inside of the core file. After resuming, the data is
correctly flushed to the disk.

This feature has a couple limitations-- you need to resume from the
same binary that you quit from. Open files need to be in the same
place and untouched. If you interface with another language, all bets
are off. This feature is only available on x86 based Linux systems at
the moment with support for other systems in the future. A further
constraint is that resumption must be from a processor that supports
the same floating point registers as the core was dumped on, ie SSE
registers. Other than that it should just work.

For those who are wondering:

-------------------------------
andy@fulcrum:~/g95/g95 % cat tst.f90

integer, pointer :: p => NULL()
p = 1
end

andy@fulcrum:~/g95/g95 % g95 tst.f90
andy@fulcrum:~/g95/g95 % a.out
Segmentation fault (core dumped)
andy@fulcrum:~/g95/g95 % a.out --resume core
Segmentation fault (core dumped)
andy@fulcrum:~/g95/g95 %

-------------------------------

Which works perfectly. The saved state is right before the fault,
so when the program resumes it faults again.

We're pretty excited about this because it opens up a wide range of
possible uses. Without writing any special code, you can force a
short job through a long queue. Another possibility is moving a
running process to another machine. Take your work home. Move to a
faster machine. Free up a fast machine.

We're in the process of writing the "G95 Power User" page, so if you
can think of something really that you can do with this, let us
know and we'll put it where everyone can see it.

I have one more large innovation planned for g95 that will make the
corefile resume look like the -r option to 'ls'. That is going to
have a wait for a while, though.

Thanks go to the testers, Michael Richmond, Doug Cox, Harald Anlauf,
Charles Rendleman and Joost Vandevondele. The two most interesting
comments were "I'm telling my symin to start backing up core files"
and "People are really going to like this (after first distrusting it
because this can't work)".

Try it for yourself and let us know how it works: http://www.g95.org

Andy

------------------
Mail: domain=firstinter.net user=andyv


Jan Vorbrüggen

2004-10-25, 3:59 am

> "People are really going to like this (after first distrusting it
> because this can't work)".


I quite agree that this is a Good Thing and Great Work, but on a historical
note:

Plus ca change...

If you read the first issues of TeX: The Program, you'll note that this was
deemed to be a standard mechanism at the time. On the TOPS-10 machines then
in use at Stanford, you could interrupt your running program at any time with
CTRL-C and then type SAVE to the monitor. TeX used this to save its internal
state after initialization. Later on, as it was ported to other systems, it
need to acquire an internal mechanism to dump its internal state.

Jan

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com