Code Comments
Programming Forum and web based access to our favorite programming groups.I've added a new feature to g95 that I've wanted for a long time. Living on the northern edge of the Sonoran desert, there are seasonal monsoons that roll through. Sometimes they get rather close. In case you've never heard it, thunder gets a special crackle to it when the bolt is within a couple hundred meters. I usually shut down my machine down about then, losing any progress a running job has made. The new feature is a mechanism for resuming jobs. Being a reasonably lazy person, the file for doing this is the existing unix core file, which the operating system already writes. A good way of getting a core file is to use the QUIT signal, which is usually bound to the control-backslash key. The behaviour of QUIT is the same as the interrupt signal (usually control-c or delete keys) except that it dumps a core file if your ulimit allows it. It's now possible to point a g95-compiled program to a core file and have it load the state of the running program and resume execution. For example: ----------------------------------- andy@fulcrum:~/g95/g95 % cat tst.f90 b = 0.0 do i=1, 10 do j=1, 3000000 call random_number(a) a = 2.0*a - 1.0 b = b + sin(sin(sin(a))) enddo print *, i, b enddo end andy@fulcrum:~/g95/g95 % g95 tst.f90 andy@fulcrum:~/g95/g95 % a.out 1 -464.5689 2 -38.27584 3 -652.6890 4 -597.2142 5 -150.8911 6 -376.1212 Quit (core dumped) andy@fulcrum:~/g95/g95 % a.out --resume core 7 -1078.404 8 -1444.724 9 -372.3247 10 -934.3513 andy@fulcrum:~/g95/g95 % ----------------------------------- Open files are reopened: ----------------------------------- andy@fulcrum:~/g95/g95 % cat tst.f90 b = 0.0 do i=1, 10 do j=1, 3000000 call random_number(a) a = 2.0*a - 1.0 b = b + sin(sin(sin(a))) enddo print *, i, b write(10,*) i, b enddo end andy@fulcrum:~/g95/g95 % g95 tst.f90 andy@fulcrum:~/g95/g95 % a.out 1 -464.5689 2 -38.27584 3 -652.6890 4 -597.2142 5 -150.8911 Quit (core dumped) andy@fulcrum:~/g95/g95 % cat fort.10 1 -464.5689 andy@fulcrum:~/g95/g95 % a.out --resume core 6 -376.1212 7 -1078.404 8 -1444.724 9 -372.3247 10 -934.3513 andy@fulcrum:~/g95/g95 % cat fort.10 1 -464.5689 2 -38.27584 3 -652.6890 4 -597.2142 5 -150.8911 6 -376.1212 7 -1078.404 8 -1444.724 9 -372.3247 10 -934.3513 andy@fulcrum:~/g95/g95 % ----------------------------------- The fort.10 file isn't up to date when the core is dumped, but it is still buffered inside of the core file. After resuming, the data is correctly flushed to the disk. This feature has a couple limitations-- you need to resume from the same binary that you quit from. Open files need to be in the same place and untouched. If you interface with another language, all bets are off. This feature is only available on x86 based Linux systems at the moment with support for other systems in the future. A further constraint is that resumption must be from a processor that supports the same floating point registers as the core was dumped on, ie SSE registers. Other than that it should just work. For those who are wondering: ------------------------------- andy@fulcrum:~/g95/g95 % cat tst.f90 integer, pointer :: p => NULL() p = 1 end andy@fulcrum:~/g95/g95 % g95 tst.f90 andy@fulcrum:~/g95/g95 % a.out Segmentation fault (core dumped) andy@fulcrum:~/g95/g95 % a.out --resume core Segmentation fault (core dumped) andy@fulcrum:~/g95/g95 % ------------------------------- Which works perfectly. The saved state is right before the fault, so when the program resumes it faults again. We're pretty excited about this because it opens up a wide range of possible uses. Without writing any special code, you can force a short job through a long queue. Another possibility is moving a running process to another machine. Take your work home. Move to a faster machine. Free up a fast machine. We're in the process of writing the "G95 Power User" page, so if you can think of something reallythat you can do with this, let us know and we'll put it where everyone can see it. I have one more large innovation planned for g95 that will make the corefile resume look like the -r option to 'ls'. That is going to have a wait for a while, though. Thanks go to the testers, Michael Richmond, Doug Cox, Harald Anlauf, Charles Rendleman and Joost Vandevondele. The two most interesting comments were "I'm telling my sy
min to start backing up core files" and "People are really going to like this (after first distrusting it because this can't work)". Try it for yourself and let us know how it works: http://www.g95.org Andy ------------------ Mail: domain=firstinter.net user=andyv
Post Follow-up to this message> "People are really going to like this (after first distrusting it > because this can't work)". I quite agree that this is a Good Thing and Great Work, but on a historical note: Plus ca change... If you read the first issues of TeX: The Program, you'll note that this was deemed to be a standard mechanism at the time. On the TOPS-10 machines then in use at Stanford, you could interrupt your running program at any time wit h CTRL-C and then type SAVE to the monitor. TeX used this to save its internal state after initialization. Later on, as it was ported to other systems, it need to acquire an internal mechanism to dump its internal state. Jan
Post Follow-up to this messagePowered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.