For Programmers: Free Programming Magazines  


Home > Archive > Fortran > June 2005 > writing to files









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author writing to files
icymist

2005-06-07, 4:02 pm

Hello,

I have a genereal doubt about writing to files.

I run codes that generate a lot (about 2 MB) of data at every step of
the execution which run into tens of thousands.

I want to know which is better -

1. Storing the data for a few steps and then writing the stored data to
a file.
or
2. Writing the data to a file each step.

Does the answer depend on the amount generated in each step?

One more thing I would like to known are the advantages and/or
divantages of writing unformatted binary files of the above data
compared to writing the formatted files. What I have observed is that
the formatted file after compression using gzip is smaller than the
compressed binary file.

Thanks for any help.

Regards,
Chaitanya.

Richard E Maine

2005-06-07, 4:02 pm

I just answered your almost identical post, but I see that you added a
few details to this copy.

In article <1118160176.872517.153480@g14g2000cwa.googlegroups.com>,
"icymist" <icymist@gmail.com> wrote:

> I run codes that generate a lot (about 2 MB) of data at every step of
> the execution which run into tens of thousands.


The specific numbers here are new data.

> [write out at each step vs buffer and write data for multiple steps]
> Does the answer depend on the amount generated in each step?


Yes. And if you have 2 mb of data per step, then you won't see
measurable differences. The 2mb is enough to swamp any overhead in
either time or space. I'd advise writing out each step separately unless
there were other considerations. If it was 2 bytes (or even 20) per
step, then things would be very different.

> One more thing I would like to known are the advantages and/or
> divantages of writing unformatted binary files of the above data
> compared to writing the formatted files. What I have observed is that
> the formatted file after compression using gzip is smaller than the
> compressed binary file.


I already mentioned some issues related to this, but the above size
numbers provide a little extra information... and I notice one odd piece
of wording that you use (in addition to the term "binary", but that
misuse is quite widespread, not directly relevant, and I already
commented on it). Namely...

You refer to the "compressed binary file". Why do you use the term
"compressed"? Is there actually any compression that you do on the file,
or are you just assuming that unformatted files are inherently
compressed? I sounds to me like you are assuming the later, which is not
correct. Quite the opposite, an unformatted file is almost always a raw
copy of the bits from memory, with no compression or alteration in any
way (a few bytes typically get added at the beginning and end of each
write, but that is negligible compared to 2mb per write). the reason
that formatted files are so much bigger is that the formatting increases
the size - not that the unformatted is compressed.

In particular, have you tried gzipping the unformatted file? Since
unformatted files are *NOT* compressed, they sometimes compress
reasonably with gzip, though typically not as much as formatted files do
(because numeric data in formatted files has so much redundancy from an
information-packing viewpoint).

The above numbers prompt me to mention another issue. Files greater in
size than 2gb can cause problems in many situations - more situations
than I could possibly list. The most recent example I ran into was that
some DVD-burning software (on multiple platforms) wasn't writing them
correctly. They might work fine for whatever particular use you have in
mind, but I advise keeping file sizes below 2 gb if practical; it will
just minimize possible headaches. You could even do something relatively
simple like close the file every gb or so and open a new one (probably
adding a sequence number into the file name). The overhead of closing
and opening a file will be negligible compared to the overall costs of
writing a gb.

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
icymist

2005-06-07, 8:58 pm

> I just answered your almost identical post, but I see that you added a
> few details to this copy.

I tried finding the post at the site, but couldn't find it. That's the
reason for my repost. I wonder why I wasn't able to see the post?
Problem with my browser's cache?

> In particular, have you tried gzipping the unformatted file? Since
> unformatted files are *NOT* compressed, they sometimes compress
> reasonably with gzip, though typically not as much as formatted files do
> (because numeric data in formatted files has so much redundancy from an
> information-packing viewpoint).

When I said compressed binary file, I meant the unformatted file after
compressing with gzip. To word what I did clearly, I compressed the
formatted file and also the unformatted file with gzip and then I
compared the values. I found that the compressed formatted file is
smaller than the compressed unformatted binary file. Hope it's clear
now. Sorry for the ambiguity.

> mind, but I advise keeping file sizes below 2 gb if practical; it will
> just minimize possible headaches. You could even do something relatively
> simple like close the file every gb or so and open a new one (probably
> adding a sequence number into the file name).

How can I close the file after every 1GB? Is it commonplace to close
files at specified sizes? I hope I am not asking a questions to which I
am supposed to know the answer.


Regards,
Chaitanya.

Richard E Maine

2005-06-07, 8:58 pm

In article <1118170441.794202.114020@g14g2000cwa.googlegroups.com>,
"icymist" <icymist@gmail.com> wrote:
> I tried finding the post at the site, but couldn't find it. That's the
> reason for my repost. I wonder why I wasn't able to see the post?


Hard to say without more data, but my best guess would be timing. It
quite often takes a little while for a post to show up, even at the same
site you posted to. Usually only a few minutes, but it can be longer.
Other reasons are also possible. Anyway...

> When I said compressed binary file, I meant the unformatted file after
> compressing with gzip.


Ah. Ok. Then I my guess as to the meaning was incorrect. Sounds like
you already tried what I suggested on that.

[color=darkred]
> How can I close the file after every 1GB? Is it commonplace to close
> files at specified sizes?


Here I was being guilty of being a bit vague. There isn't any magic or
automated way to close based on size. I was thinking about very
simple-minded stuff... Just keep rough track of how much you have
written. Doesn't even have to be very precise for this purpose. When the
total gets big enough (for whatever definition of "big enough" seems
appropriate), close the file. Then reopen the same unit number with a
new file using a different file name.

Commonplace? Well, not very. But neither is it unheard of. Depends on
the application. It is quite common in things like log files that would
otherwise grow indefinitely. They are typically closed and then a new
version opened based on various criteria. Sometimes the criterion is
just time (once per day or w or whatever). But sometimes you see
criteria like file size used also (close the file after a specified time
or file size, whichever comes first).

--
Richard Maine | Good judgment comes from experience;
email: my first.last at org.domain | experience comes from bad judgment.
org: nasa, domain: gov | -- Mark Twain
glen herrmannsfeldt

2005-06-07, 8:58 pm

Richard E Maine wrote:

> In article <nvOdncw76Le_QzjfRVn-tQ@comcast.com>,
> glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:


[color=darkred]
> That can help in some situations, but not all.


Yes, it was a little specific to that situation.

> Of course, it assumes that the compression program can
> handle a file 2gb (and even more basically,
> that the file system can handle it). Odds of
> that are at least reasonably good, but not certain.


Most unix compression programs, including at least compress
and gzip, are stream compressors that can be used in a pipeline.
They don't s, and don't know at all how long the data stream is.

That might not be true for many Windows programs, though there
are versions of gzip that will run on windows.

> It also assumes that the programs writing and reading
> the file can write and read files >2gb.
> That is *FAR* from given. Of particular relevance
> to this newsgroup is that there exist Fortran compilers
> that won't handle such files, even if the underlying
> operating system is capable of it.


That was true in the case I described, even though
it wasn't Fortran. The OS could do it, but only for programs
with the largefiles attribute.

> Compressing after the fact can avoid problems with intermediate steps,
> such as transferring the data around. For example, it would have worked
> as a solution to the problem I had in burning a >2gb file to a DVD. But
> it still leaves you vulnerable to some potential problems. Whether the
> remaining problems are relevant r not depends on the particular
> application.


When working with large amounts of data it always takes extra work
to be sure that it is done right. Often (though not always) the costs
of recreating a large data set are more than a small data set.
I always like to do extra checks to make sure that it is copied
properly.

For Fortran UNFORMATTED files, which are harder to verify directly,
I have used programs that will print out some indication that the
data is right. If arrays are written to the file, I will read back
the data and print out the dimensions of the arrays. It might be
written as

WRITE(1) N,(A(I),I=1,N)

in a loop. I can then read it back and print out the values of N,
or maybe just the number of records written to the file.
That will catch some of the problems that could occur, such as
the file being truncated at some point.

-- glen


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com