Home > Archive > Compression > September 2006 > zip, read/write tests
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
zip, read/write tests
|
|
| cr88192 2006-09-25, 6:55 pm |
| if anyone feels like commenting that would be .
I am mostly just trying to get good performance (in terms of space
fragmentation, error recovery, ...) from the zip format, which is slightly
difficult given the design of the format.
well, yesterday at least I got around to modifying a previous library of
mine (zpack) to work with zipfiles instead of the custom format. technically
it works, but for some reason it seems when being used read/write, the
archive ends up being about 2x the size of the compressed payload.
now, it is probably not debugged/cleaned up enough for practical use yet
(dates, ... are not presently stored), and I had found quite a few bugs, and
in places the code is pretty hacky, so yeah...
I was also testing against windows and unzip, as I lack many other zip tools
right now.
about the most troublesome thing I found was that yes, I need to keep the
header at the end of the file, and I also need to keep the header directly
following the central directory.
at first, I did things about like before, putting the directory wherever it
would fit, and windows was fine with this, however, unzip complained about
this (but could still access the files).
additionally, zip lacks much notion of holding space "in reserve" for the
directory, which lacks such notions as 'free entires', and will tend to vary
in size. this made problems for space management.
the current approach basically uses a hacky piece of code to try to reserve
space near the end of the file (but before the current true end if
possible), but this is ugly, and still has a problem:
if space is needed and doesn't exist within the archive, the directory will
jump forwards leaving a hole, which may or may not be filled.
one possibility is this:
during operation, the directory and header are pretended not to exist;
once the file is opened, the spans are cleared, and regarded as free space;
in this case, on commit, space is located near the end of the file (and if
the original space has not been touched, it is reused).
the problem is this:
if a crash occures, since neither the CD not the header is in-tact, the
archive is effectively destroyed (apart from recreation from in-archive
headers).
actually, I could do this:
the CD is left in tact, or an auxilary CD is kept hidden in the contents
(possibly compressed), along with a backup header at the start of the file
(that or this is the true header/directory, as far as the lib is concerned,
and the traditional CD/header is used/stored mostly for compatibility).
in this way, I can go to how I was before, eg, commiting after every write,
without problem. the issue would then be primarily that a crash would break
complete zip compatibility...
another possibility is this:
I store the file like I have (no backup header or auxilary CD), and if the
file is damaged, the CD is reconstructed using a brute-force scan of the
archive (possible since all the stored files have headers anyways).
I may need an invalidation method (eg: modifying the markers on no-longer
valid files), maybe to try to prevent attempts to recoverer old or partial
versions of files...
possible: for faster scanning in this case, I could force everything to be
aligned on fixed boundaries (eg: 16 or 64 bytes).
actually, doing this (directly) would cause difficulties for externally
created archives, so better (if used at all) would be a better
auto-alignment hack (done within the caching code itself), or simply a
modified version of the free-space search function (searches for a big
enough space aligned to a fixed boundary).
then again, the scanning function would miss files in damaged archives which
contain unaligned content.
could still do it, and maybe store a special marker (an 'everything is
aligned ok' marker). except, I will ignore the aligning the CD, no need to
have it aligned in this case...
I can put it at the start of the file, why?
partly it is because of an issue in my space management code, I always need
something right at the start of the archive (otherwise, I could modify my
code to deal with the case where nothing is there...). in the zpack format,
it didn't matter, since this is where I put the header...
so, at the start of the file will probably be a general informational header
(telling about, eg, if things are aligned ok, and what state the archive is
in, ...).
actually, I may need another spec mostly just describing how my lib
interprets/manages the zipfile contents (and any custom tags used).
or such...
any thoughts or comments?...
| |
|
|
"cr88192" <cr88192@NOSPAM.hotmail.com>, haber iletisinde şunları
yazdı:dc1c7$45184cc2$ca83a8d6$5873@saipa
n.com...
> if anyone feels like commenting that would be .
>
> I am mostly just trying to get good performance (in terms of space
> fragmentation, error recovery, ...) from the zip format, which is slightly
> difficult given the design of the format.
>
>
> well, yesterday at least I got around to modifying a previous library of
> mine (zpack) to work with zipfiles instead of the custom format.
technically
> it works, but for some reason it seems when being used read/write, the
> archive ends up being about 2x the size of the compressed payload.
When you add a new file, what do you do? You should delete CD from first and
add the new file at the offset the CD began, right? Then what do you do? Do
you write the CD back. If you do it and if you will add more files again
that would be an unnecessary IO. You should keep CD in memory instead and
only put it back after all the additions are made.
>
> now, it is probably not debugged/cleaned up enough for practical use yet
> (dates, ... are not presently stored), and I had found quite a few bugs,
and
> in places the code is pretty hacky, so yeah...
>
>
> I was also testing against windows and unzip, as I lack many other zip
tools
> right now.
>
> about the most troublesome thing I found was that yes, I need to keep the
> header at the end of the file, and I also need to keep the header directly
> following the central directory.
Not clear to me. What do you mean by "keep the header at the end of the
file" and by "keep the header directly following the CD"?
Do you maintain the following?
Overall zipfile format:
[local file header + file data + data_descriptor] . . .
[central directory] end of central directory record
>
> at first, I did things about like before, putting the directory wherever
it
> would fit, and windows was fine with this, however, unzip complained about
> this (but could still access the files).
What do you mean by "putting the directory wherever it would fit"?
Do you mean putting the CD wherever it would fit. So you can have entries
after the CD? Is that what you mean? But according to zip spec, you should
have CD at the end followed by end of CD record at the very end.
>
> additionally, zip lacks much notion of holding space "in reserve" for the
> directory, which lacks such notions as 'free entires', and will tend to
vary
> in size. this made problems for space management.
>
> the current approach basically uses a hacky piece of code to try to
reserve
> space near the end of the file (but before the current true end if
> possible), but this is ugly, and still has a problem:
> if space is needed and doesn't exist within the archive, the directory
will
> jump forwards leaving a hole, which may or may not be filled.
>
What do you need space for? To add a new file or to update an entry which
would not fit in the current offset?
Both cases, you need remove CD from disk and add it at the offset the CD
starts.
> one possibility is this:
> during operation, the directory and header are pretended not to exist;
> once the file is opened, the spans are cleared, and regarded as free
space;
> in this case, on commit, space is located near the end of the file (and if
> the original space has not been touched, it is reused).
>
> the problem is this:
> if a crash occures, since neither the CD not the header is in-tact, the
> archive is effectively destroyed (apart from recreation from in-archive
> headers).
>
Yes. It can happen. You need to recover by scanning the local headers. But
you should code it in a way to make it less likely to happen.
After your code matures, there should not be any crash at all, or you should
deal with it accordingly.
>
> actually, I could do this:
> the CD is left in tact, or an auxilary CD is kept hidden in the contents
> (possibly compressed), along with a backup header at the start of the file
> (that or this is the true header/directory, as far as the lib is
concerned,
> and the traditional CD/header is used/stored mostly for compatibility).
>
> in this way, I can go to how I was before, eg, commiting after every
write,
> without problem. the issue would then be primarily that a crash would
break
> complete zip compatibility...
>
>
> another possibility is this:
> I store the file like I have (no backup header or auxilary CD), and if the
> file is damaged, the CD is reconstructed using a brute-force scan of the
> archive (possible since all the stored files have headers anyways).
>
> I may need an invalidation method (eg: modifying the markers on no-longer
> valid files), maybe to try to prevent attempts to recoverer old or partial
> versions of files...
>
> possible: for faster scanning in this case, I could force everything to be
> aligned on fixed boundaries (eg: 16 or 64 bytes).
>
> actually, doing this (directly) would cause difficulties for externally
> created archives, so better (if used at all) would be a better
> auto-alignment hack (done within the caching code itself), or simply a
> modified version of the free-space search function (searches for a big
> enough space aligned to a fixed boundary).
>
> then again, the scanning function would miss files in damaged archives
which
> contain unaligned content.
> could still do it, and maybe store a special marker (an 'everything is
> aligned ok' marker). except, I will ignore the aligning the CD, no need to
> have it aligned in this case...
>
> I can put it at the start of the file, why?
> partly it is because of an issue in my space management code, I always
need
> something right at the start of the archive (otherwise, I could modify my
> code to deal with the case where nothing is there...). in the zpack
format,
> it didn't matter, since this is where I put the header...
It may be OK for your utility but is it OK for zip spec?
>
> so, at the start of the file will probably be a general informational
header
> (telling about, eg, if things are aligned ok, and what state the archive
is
> in, ...).
>
>
> actually, I may need another spec mostly just describing how my lib
> interprets/manages the zipfile contents (and any custom tags used).
>
> or such...
>
>
> any thoughts or comments?...
>
>
| |
| cr88192 2006-09-26, 3:55 am |
|
"Aslan" <aslanski2002@yahoo.com> wrote in message
news:efafrl$j8f$1@emma.aioe.org...
>
> "cr88192" <cr88192@NOSPAM.hotmail.com>, haber iletisinde şunları
> yazdı:dc1c7$45184cc2$ca83a8d6$5873@saipa
n.com...
> technically
>
> When you add a new file, what do you do? You should delete CD from first
> and
> add the new file at the offset the CD began, right? Then what do you do?
> Do
> you write the CD back. If you do it and if you will add more files again
> that would be an unnecessary IO. You should keep CD in memory instead and
> only put it back after all the additions are made.
>
it doesn't operate like this, exactly.
when it needs to store something ("commiting changes"), it will compress
into a buffer to figure out how big it is, and (if it already has a spot)
checking if the new data will fit there, otherwise it removes the old span,
and does a query to find the best-fit.
now, consider you have an archive, and you add a bunch of files. so far, so
well, most go as expected. then consider you add more files and re-add some
old ones, now then, it ends up allocating new spots (potentially within the
file, or potentially at the end), effectively fragmenting the space.
combined with the current CD management (the fact it has to remain at the
EOF), this leads to additional fragmentation, which I have found leads to
about a 2x inflation. I have found this to be more or less a constant (the
file does not increase over time, eg, absent increasing the amount of
compressed data).
theoretically, the fragmentation should be lower, eg, a 4/3 inflation, or
maybe a 50% (3/2) inflation, but not 2x...
I suspect part of this may be that the space-management code is naive, and
keeps trying to put stuff after the CD, and the CD keeps jumping back to the
EOF, effectively leaving holes that may be filled later (resulting in an
additional amount of inflation, that stabilizes once no more data is
attempted to be put after the CD).
as such, things only go on the end if no better place exists. this is to try
to limit "wandering inflation", eg, as a natural byproduct of modifying
data, and committing changes to disk, the archive keeps getting bigger...
> and
> tools
>
> Not clear to me. What do you mean by "keep the header at the end of the
> file" and by "keep the header directly following the CD"?
> Do you maintain the following?
>
> Overall zipfile format:
>
> [local file header + file data + data_descriptor] . . .
> [central directory] end of central directory record
>
well, I was pushing this, trying to see if I could do things like:
[files preceding CD]
[central directory]
[files following CD]
[end of central directory record]
why? because this leads to less fragmentation, and avoids me having to use a
funky set of hacks to always keep the CD at the end of file.
in the last zpack format, the directory was put "wherever there was room",
likewise, the directory would keep itself padded as well (eg: when possible
about 3/2 the size of the payload).
>
>
> it
>
> What do you mean by "putting the directory wherever it would fit"?
> Do you mean putting the CD wherever it would fit. So you can have entries
> after the CD? Is that what you mean? But according to zip spec, you should
> have CD at the end followed by end of CD record at the very end.
>
this is what the spec says, but I was seeing if any tools would care. unzip
did, so I figure maybe it mattered. windows doesn't, however.
this restriction, however, is annoying as it makes space management more
difficult, and limits how often I can meaningfully commit changes to disk.
> vary
> reserve
> will
> What do you need space for? To add a new file or to update an entry which
> would not fit in the current offset?
yes.
> Both cases, you need remove CD from disk and add it at the offset the CD
> starts.
>
but, note that most of the code does not know of or care where the CD is, as
(since most of this code was originally written for a rather different
format) it tends to try to view the CD as "yet another file", but this
doesn't work so well with zip.
all the CD management stuff is largely constrained to the "commit" function,
which has to look at the damage done by the other code.
> space;
> Yes. It can happen. You need to recover by scanning the local headers. But
> you should code it in a way to make it less likely to happen.
> After your code matures, there should not be any crash at all, or you
> should
> deal with it accordingly.
>
note, I can elimate crashes, within the lib, but not within the host app.
additionally, there is no guerantee that the host app will exit cleanly
either (or properly unmount everything).
so, complete stability can't be assumed.
recovery should be as quick and painless as possible.
for larger archives, a linear scan (even with a large alignment) may be
impractical, so some other means (eg: a differently managed directory
structure) may still be needed. likewise, I have to pre-process the
directory to make it usable anyways, so the zip format may not be all that
scalable either (internally, it pre-proccesses the directories into a form
similar to that used in the zpack format).
> concerned,
> write,
> break
> which
> need
> format,
>
> It may be OK for your utility but is it OK for zip spec?
>
well, the zip spec doesn't much go into how to deal with unknown "garbage".
theoretically, it shouldn't matter.
actually, I am more recently partly considering breaking altogether from a
strict zip adherence, and instead making a format technically closer to a
"faux zip".
so, plain zip files may be usable, but read-only, and my files can be
accessed with zip tools, but internally the lib/format will do it's magic
with a different set of structures.
so, what may the format be like:
header at the start of the file, and a directory located anywhere in the
archive (the ZIP CD may be ignored);
I may split directories (eg: a representation more like that of FAT in some
respects, vs a linking tree structure);
....
then again, a segmented directory scheme, to be used efficiently, would
demand a different means of space management (a statically stored tree, eg,
probably a B-Tree), as the current approach involves making a pass over the
directory tree to rebuild the spans tables (at load time).
that or I could be like "zip is good enough", and live with the poor
scalability (it is fine for "light duty" use, eg, maybe a few hundred or a
few K files, but I doubt much more than this, eg, like a real filesystem).
or I could declare that it is pointless to emulate zip in this case (I have
about lost the whole purpose of zip support if it is simply a hindered
emulation), and just stick with either the older, or a revised version of
the ZPACK format.
so, my options are this:
zip, and assume only light duty uses (a few k files or less);
another format, but I damn well better make it scalable, otherwise there is
no real point...
while I was at it, for such a "heavy duty" format, may as well include
transaction logs as well (vs, the half-assed psuedo transactional approaches
I use now...).
or such...
| |
|
|
"cr88192" <cr88192@NOSPAM.hotmail.com>, haber iletisinde şunları
yazdı:31538$4518d43d$ca83a8d6$16340@saip
an.com...
<snip>
and[color=darkred]
>
> it doesn't operate like this, exactly.
> when it needs to store something ("commiting changes"), it will compress
> into a buffer to figure out how big it is, and (if it already has a spot)
> checking if the new data will fit there, otherwise it removes the old
span,
> and does a query to find the best-fit.
What if the buffer is not big enough to compress it? Most of the times it
should work, but say you are adding a very big file which you cannot
compress it to a buffer because you don't have enough memory to do it.
>
> now, consider you have an archive, and you add a bunch of files. so far,
so
> well, most go as expected. then consider you add more files and re-add
some
> old ones, now then, it ends up allocating new spots (potentially within
the
> file, or potentially at the end), effectively fragmenting the space.
> combined with the current CD management (the fact it has to remain at the
> EOF), this leads to additional fragmentation, which I have found leads to
> about a 2x inflation. I have found this to be more or less a constant (the
> file does not increase over time, eg, absent increasing the amount of
> compressed data).
>
> theoretically, the fragmentation should be lower, eg, a 4/3 inflation, or
> maybe a 50% (3/2) inflation, but not 2x...
>
> I suspect part of this may be that the space-management code is naive, and
> keeps trying to put stuff after the CD, and the CD keeps jumping back to
the
> EOF, effectively leaving holes that may be filled later (resulting in an
> additional amount of inflation, that stabilizes once no more data is
> attempted to be put after the CD).
>
>
> as such, things only go on the end if no better place exists. this is to
try
> to limit "wandering inflation", eg, as a natural byproduct of modifying
> data, and committing changes to disk, the archive keeps getting bigger...
>
>
yet[color=darkred]
bugs,[color=darkred]
the[color=darkred]
>
> well, I was pushing this, trying to see if I could do things like:
>
>
> [files preceding CD]
> [central directory]
> [files following CD]
> [end of central directory record]
>
> why? because this leads to less fragmentation, and avoids me having to use
a
> funky set of hacks to always keep the CD at the end of file.
>
I have never tried this. I maintain a flag to see whether the CD is on the
disk or not. Once it is removed (by adding a new file for example), it is
kept in the memory until the program terminates then it is written back to
the disk.
<snip>
> note, I can elimate crashes, within the lib, but not within the host app.
> additionally, there is no guerantee that the host app will exit cleanly
> either (or properly unmount everything).
>
> so, complete stability can't be assumed.
> recovery should be as quick and painless as possible.
Right.
<snip>
my[color=darkred]
>
> well, the zip spec doesn't much go into how to deal with unknown
"garbage".
>
> theoretically, it shouldn't matter.
>
>
> actually, I am more recently partly considering breaking altogether from a
> strict zip adherence, and instead making a format technically closer to a
> "faux zip".
>
> so, plain zip files may be usable, but read-only, and my files can be
> accessed with zip tools, but internally the lib/format will do it's magic
> with a different set of structures.
>
>
> so, what may the format be like:
> header at the start of the file, and a directory located anywhere in the
> archive (the ZIP CD may be ignored);
> I may split directories (eg: a representation more like that of FAT in
some
> respects, vs a linking tree structure);
> ...
>
> then again, a segmented directory scheme, to be used efficiently, would
> demand a different means of space management (a statically stored tree,
eg,
> probably a B-Tree), as the current approach involves making a pass over
the
> directory tree to rebuild the spans tables (at load time).
>
>
> that or I could be like "zip is good enough", and live with the poor
> scalability (it is fine for "light duty" use, eg, maybe a few hundred or a
> few K files, but I doubt much more than this, eg, like a real filesystem).
>
> or I could declare that it is pointless to emulate zip in this case (I
have
> about lost the whole purpose of zip support if it is simply a hindered
> emulation), and just stick with either the older, or a revised version of
> the ZPACK format.
>
>
> so, my options are this:
> zip, and assume only light duty uses (a few k files or less);
> another format, but I damn well better make it scalable, otherwise there
is
> no real point...
>
> while I was at it, for such a "heavy duty" format, may as well include
> transaction logs as well (vs, the half-assed psuedo transactional
approaches
> I use now...).
>
> or such...
Why do you need zip format? Maybe you should maintain your format and if
necessary you can turn your format into a zip format.
You can have utilities to read zip files and if necessary you can output a
zip file by reading your files. For example it is very straightforward to
turn a gzip file into a zip file. Just arrange the header and raw copy
already deflated file data after it than update the CD.
>
>
>
| |
| cr88192 2006-09-26, 7:55 am |
|
"Aslan" <aslanski2002@yahoo.com> wrote in message
news:efaodo$hj5$1@emma.aioe.org...
>
> "cr88192" <cr88192@NOSPAM.hotmail.com>, haber iletisinde şunları
> yazdı:31538$4518d43d$ca83a8d6$16340@saip
an.com...
>
> <snip>
>
> and
> span,
>
> What if the buffer is not big enough to compress it? Most of the times it
> should work, but say you are adding a very big file which you cannot
> compress it to a buffer because you don't have enough memory to do it.
>
well, for the most part this lib is not specialized in compressing large
files, but rather a large number of small files.
as for buffer size, I assume that there is a max amount data can expand when
compressed, and allocate buffers accordingly.
actually, the previous version supported fragmenting the file, but this
version doesn't (main reason being that zip can't represent fragmented files
anyways...).
<snip>
> a
> I have never tried this. I maintain a flag to see whether the CD is on the
> disk or not. Once it is removed (by adding a new file for example), it is
> kept in the memory until the program terminates then it is written back to
> the disk.
>
I assume then your lib works rather differently than mine...
my case, this is not as simple.
actually, I considered something similar, but the problem is that, since the
only time the CD can be meaningfully written is on closing the archive, much
of the time it will be absent leaving a lot of room for possible problems...
> <snip>
>
>
> Right.
>
<snip>
> have
> is
> approaches
>
> Why do you need zip format? Maybe you should maintain your format and if
> necessary you can turn your format into a zip format.
well, I don't really, apart from a mild impulse to be "standard".
otherwise, I currently have other code that accesses zip files (and is
read-only, but is based on my previous understanding of the format, and thus
would have problems with archives actually using the things I have
discovered are possible...).
the problem with zip is its design itself, which thinking more, just isn't
so great for non-trivial uses.
> You can have utilities to read zip files and if necessary you can output a
> zip file by reading your files. For example it is very straightforward to
> turn a gzip file into a zip file. Just arrange the header and raw copy
> already deflated file data after it than update the CD.
>
maybe.
I may modify my format though, as zip has a few possible features I could
make use of (in particular, that info is attached to the file payloads as
well).
actually, I had considered it before, but instead opted to keep the data in
the directory.
data with file:
potentially worse for performance (have to s and read info);
potentially better for reliability (find header, and there is a chance of
recovery);
more natural handling of variable-length items (names, optional data, ...).
data in directory:
may be more efficient (info is already there);
less flexibility (in my case, having fo fit data within fixed-size items);
....
a hybrid is possible as well, eg, some data may be stored within the dir
entry, and other data with the file. eg, if the dirent only stores the name,
the offset, and a few other pieces of info, but anything else goes with the
file, then the task is simplified. fragmentation also potentially becomes a
little cleaner (each fragment having its own header).
however, the format itself is less blatently simple, but this may be ok
(this version may more emphasize "scalability" and "error-recovery" than
"simplicity"). likewise, since "forks" ended up being added anyways, I may
as well add a less crufty mechanism for dealing with them (since entries
were linked together, I would have each entry followed by other entries
which could potentially define forks).
....
then again, it could be that I am wasting time better spent on other
things...
or such...
| |
| Carsten Neubauer 2006-09-26, 7:55 am |
| cr88192 schrieb:
> ...
>
> then again, it could be that I am wasting time better spent on other
> things...
>
Might be true. The design of zipfiles makes handling complicated.
The best workaround I found to deal with a zipfile is to split it into
two temporary files (one for the file-data and one for the central
directory), do adding, replacing or deleting, combine the two files
again and when the new zipfile is complete, I replace the original
archive with the new file.
This is quite ugly in terms of efficiency and read/write accesses,
but minimizes the possibility to have some crash leaving a
damaged zipfile.
Just my point of view,
Carsten Neubauer
http://www.c14sw.de/
| |
| Darius Blaszijk 2006-09-26, 6:55 pm |
| >> What do you mean by "putting the directory wherever it would fit"?
>
> this is what the spec says, but I was seeing if any tools would care.
> unzip did, so I figure maybe it mattered. windows doesn't, however.
What uzip version did you use? I added two lfh and filedata between CD and
end of CD header and it worked. Sure uzip complained about having too much
bytes between the two headers, but it unziped just fine. Windows didn't even
notice it like you said.
I tried using unzip 5.42.
I like your ideas, but if I may ask you what does adding data between the CD
and end of CD header bring you? You still need to add the end of CD header
after all data is written. So there is still a potential zipfile corruption
during a crash.
I'm thinking of streaming all data and adding to it to temporary file and
then rename the file to the original. This way if something crashes you
still have the original file.
Darius
| |
| cr88192 2006-09-27, 3:55 am |
|
"Carsten Neubauer" <cn@c14sw.de> wrote in message
news:1159273139.758115.107170@m73g2000cwd.googlegroups.com...
> cr88192 schrieb:
>
> Might be true. The design of zipfiles makes handling complicated.
> The best workaround I found to deal with a zipfile is to split it into
> two temporary files (one for the file-data and one for the central
> directory), do adding, replacing or deleting, combine the two files
> again and when the new zipfile is complete, I replace the original
> archive with the new file.
>
> This is quite ugly in terms of efficiency and read/write accesses,
> but minimizes the possibility to have some crash leaving a
> damaged zipfile.
>
yeah.
actually, thinking of it, I could probably rig up a piece of code to just
test how long it takes to scan a file of a given size. if I do IO in
buffers, it should be possible to do it faster.
if it is fast enough, I could just be like, "ok, good enough".
most archives are how much? maybe a few 10s or 100s of MB, should be able to
grind through this in a few seconds.
zip does have the advantage that it is common/a de-facto standard, and is
probably good enough.
after designing it some, the "alternative" starts looking a little more like
an actual filesystem, with the implied complexity...
I am probably not going to be dealing with "that" many files (millions or
more) anyways.
otherwise, I have some 3D stuff I could be working on, or even working on my
language VM (or, maybe do homework...), ...
so yeah...
>
> Just my point of view,
>
>
> Carsten Neubauer
> http://www.c14sw.de/
>
| |
| cr88192 2006-09-27, 3:55 am |
|
"Darius Blaszijk" <dhkblaszyk@zeelandnet.nl> wrote in message
news:wtOdnVM35827BYTYnZ2dnUVZ8sidnZ2d@ze
elandnet.nl...
>
> What uzip version did you use? I added two lfh and filedata between CD and
> end of CD header and it worked. Sure uzip complained about having too much
> bytes between the two headers, but it unziped just fine. Windows didn't
> even notice it like you said.
>
> I tried using unzip 5.42.
>
I am using 5.50.
yes, it gives warnings, but it works.
> I like your ideas, but if I may ask you what does adding data between the
> CD and end of CD header bring you?
the ability to more efficiently, and more easily, manage free space...
I am using a best-fit allocator, which is not so great for dealing with
arbitrary constraints.
> You still need to add the end of CD header after all data is written. So
> there is still a potential zipfile corruption during a crash.
yes, this is true, however since the EOCD marker is small, I don't have to
worry about it as much (even if done poorly, what does it cost? maybe 22
bytes, no big deal). now, with the whole central directory, this is a big
deal.
> I'm thinking of streaming all data and adding to it to temporary file and
> then rename the file to the original. This way if something crashes you
> still have the original file.
>
except that this is problematic in the case of a real-time format, likewise,
the file may be big enough that the io overhead of copying a bunch of data,
maybe a good number of times per second, is unrealistic.
if at all possible, I wanted fine-grained transactions (before, this was
linked to whenever a file was closed or flushed, a few pieces of data may be
written, but little else). but the zip format seems to always want to force
me to do coarse-grained transactions (say, writing the central directory
when the archive is closed/unmounted, or maybe "once in a great while").
then again, this may be ok, under the premise I can have code to regenerate
the CD as a backup measure...
> Darius
>
>
| |
| Darius Blaszijk 2006-09-27, 7:55 am |
| >>>> What do you mean by "putting the directory wherever it would fit"?
>
> I am using 5.50.
>
> yes, it gives warnings, but it works.
So this means that putting the data between CD and end of CD header (EOCDH)
is producing valid zipfiles. Great! You could even think of adding a sort of
"defrag" algorithm that reads all data and repositions the lfh's and CD and
respectively shrinks the size of the zipfile again by removing the holes.
This way you will get no comments anymore from unzip. You only need to do
that once in a while, just keep track of fragmentation and when it reaches a
certain limit sweeps the file and defragment it.
>
> the ability to more efficiently, and more easily, manage free space...
> I am using a best-fit allocator, which is not so great for dealing with
> arbitrary constraints.
>
>
>
> yes, this is true, however since the EOCD marker is small, I don't have to
> worry about it as much (even if done poorly, what does it cost? maybe 22
> bytes, no big deal). now, with the whole central directory, this is a big
> deal.
But still you are under the risk of complete file corruption (unless you
recover by reading the lfh's) because the EOCDH is the most important
structure in the zipfile.
Is there such a big difference between writing 22bytes or 64kb's to the end
of the file? I my lib I have both CD and EOCDH in memory. Why not write all
of them after each file?
>
> except that this is problematic in the case of a real-time format,
> likewise, the file may be big enough that the io overhead of copying a
> bunch of data, maybe a good number of times per second, is unrealistic.
That is true yes. I never thought of writing and reading "continuously" to
the zipfile. I'm my mind I would read only once and write everything back
when the app closes. But accessing files real-time would become verrrrrry
slow in this case. Especially large zipfiles with lots of smaller files in
it.
>
> if at all possible, I wanted fine-grained transactions (before, this was
> linked to whenever a file was closed or flushed, a few pieces of data may
> be written, but little else). but the zip format seems to always want to
> force me to do coarse-grained transactions (say, writing the central
> directory when the archive is closed/unmounted, or maybe "once in a great
> while").
>
> then again, this may be ok, under the premise I can have code to
> regenerate the CD as a backup measure...
I think a good lib needs this anyway. Is this also how PKZipFix works?
Darius
|
|
|
|
|