For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2006 > misc: zip format, approach to read/write access...









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author misc: zip format, approach to read/write access...
cr88192

2006-04-11, 9:55 pm

I have sat around and wondered about this:
is there any good way to handle read/write access to zip files (say, if the
intent is to use the archive as a kind of filesystem, occasionally replacing
contents with updated versions of the files?...).

looking over the format, it does not look particularly well suited to this,
but from the file structure it is not entirely ruled out. since various
parts of the file have markers, conceivably one could skip over any
unrecognized contents by searching for the markers. slightly better imo
would be some form of padding marker (indicating the relative offset of the
next valid header or such), but afaict zip does not define such a feature
(nor many others that could be helpful).

a quick test suggests that the zip utils are able to handle some amount of
garbage thrown in. this implies a "hybrid" between zip and some other format
could be possible, where the other format maintains enough context to allow
reorganizing the contents of the file, managing free space, ..., and the zip
tools just see some amount of garbage (presumably with any zip markers or
other potentially confusing data wiped out).

or is there some better and more common approach?...


quick check seems to imply that the existing tools seem to just rewrite the
file when doing updates, which is unlikely to be entirely acceptable in my
case. this could be workable in some cases (modification of fairly small
file sets), and would only require minor modifications to my existing (read
only) filesystem code (when needed it would rewrite the file), and likely
adding a special mount mode ("rrw", read-rewrite).


or, with all this is it a better idea just to do a custom format
outright?...

this had been considered, the format imagined (this time) was closer to a
funky hybrid of zip and the id pak format (I called it zpak, implying a
compressed variant of pak).

note, it would only rewrite whole files at a time (likely controlled via
flush, closing the file, or when more memory needs to be freed) and would
not support file fragmentation, rather simplifying the implementation, if
limiting contained filesizes, but this is ok for my uses...

then again, this is not zip, and anyone who wanted to get at the contents
would need to use a tool (probably not that big of a deal, but I don't
know).


any comments?...


Nishu

2006-04-12, 3:55 am


cr88192 wrote:
> I have sat around and wondered about this:
> is there any good way to handle read/write access to zip files (say, if the
> intent is to use the archive as a kind of filesystem, occasionally replacing
> contents with updated versions of the files?...).
>


Updating the zipped archive of a single file without
rewriting(reencoding) may not be the good idea, considering the factor
that probabities of occurance of a char may vary with update.

Using archive as a kind of file system (with lots of files) and then
updating it with a new file is not new.. Probably archieving works in
this manner only. eg, emailing systems like lotus notes do updated
archieving of mails. I not sure if they are not compressed formats!!

> looking over the format, it does not look particularly well suited to this,
> but from the file structure it is not entirely ruled out. since various
> parts of the file have markers, conceivably one could skip over any
> unrecognized contents by searching for the markers. slightly better imo
> would be some form of padding marker (indicating the relative offset of the
> next valid header or such), but afaict zip does not define such a feature
> (nor many others that could be helpful).
>


What made u think that its not happeneing. Using a zip over a range of
files which includes zipped formats also will simply make it skip the
thing (at least it does for the files generated with same encoder, eg.
winzip ) and go over to the next unzipped file. Its a kind of archiving
only, and in user defined manner.

> a quick test suggests that the zip utils are able to handle some amount of
> garbage thrown in. this implies a "hybrid" between zip and some other format
> could be possible, where the other format maintains enough context to allow
> reorganizing the contents of the file, managing free space, ..., and the zip
> tools just see some amount of garbage (presumably with any zip markers or
> other potentially confusing data wiped out).
>
> or is there some better and more common approach?...
>
>


its vague here.

> quick check seems to imply that the existing tools seem to just rewrite the
> file when doing updates, which is unlikely to be entirely acceptable in my
> case. this could be workable in some cases (modification of fairly small
> file sets), and would only require minor modifications to my existing (read
> only) filesystem code (when needed it would rewrite the file), and likely
> adding a special mount mode ("rrw", read-rewrite).
>
>


I think rewriting would be a better approach. I did a simple
experiment. I got two file in txt format and then zipped them then i
compared it with a file which was a combination of both (i'm onsidering
it as a rewrite thing). i got the expected obvious results of gained
compression by 20%. Even If i negate the fact that zip need to
distinguish between the two file, still imo the overhead cant exceed
the gain of 20%. Rewiting then doing reencoding is obvious choice,
thou' at a expense of more time. Appending a new zipped data of a file
to already a zipped data of the same file will have extra overhead
while reconstruction.

> or, with all this is it a better idea just to do a custom format
> outright?...
>
> this had been considered, the format imagined (this time) was closer to a
> funky hybrid of zip and the id pak format (I called it zpak, implying a
> compressed variant of pak).
>
> note, it would only rewrite whole files at a time (likely controlled via
> flush, closing the file, or when more memory needs to be freed) and would
> not support file fragmentation, rather simplifying the implementation, if
> limiting contained filesizes, but this is ok for my uses...
>
> then again, this is not zip, and anyone who wanted to get at the contents
> would need to use a tool (probably not that big of a deal, but I don't
> know).
>
>
> any comments?...


cr88192

2006-04-12, 3:55 am


"Nishu" <naresh.attri@gmail.com> wrote in message
news:1144814847.217874.20460@z34g2000cwc.googlegroups.com...
>
> cr88192 wrote:
>
> Updating the zipped archive of a single file without
> rewriting(reencoding) may not be the good idea, considering the factor
> that probabities of occurance of a char may vary with update.
>

note:
the zipfile will contain a number of files, but will itself be rewritten.

individual parts of the file (eg: individual compressed files) will
typically be recompressed when being written.

the problem then becomes with the structure of the file in general, as
stream-like formats are generally fairly sensitive to organization issues
(vs. graph-structured formats...).

in general though, most of the actual file io will be being done in memory,
with any true recompressing only occuring in some cases (the contained file
is closed, for example, or more memory is needed in the cache for opening
more files).


> Using archive as a kind of file system (with lots of files) and then
> updating it with a new file is not new.. Probably archieving works in
> this manner only. eg, emailing systems like lotus notes do updated
> archieving of mails. I not sure if they are not compressed formats!!
>

yeah, though probably not with zip.

most archivers typically compress or decompress data in a stream-like
manner, rather than doing so randomly.

>
> What made u think that its not happeneing. Using a zip over a range of
> files which includes zipped formats also will simply make it skip the
> thing (at least it does for the files generated with same encoder, eg.
> winzip ) and go over to the next unzipped file. Its a kind of archiving
> only, and in user defined manner.
>

I am not sure if you are talking of the same thing.

anyways, if when scanning through a file, then the archiver encounters a
large glob of random data it has to seach through to find the next header,
this can't be good for performance. better not to do it this way if it can
be avoided.

likewise, my tests have shown that, eg, infozip will largely complain about
the presence of unexpected garbage...

>
> its vague here.
>


basically, I was talking about taking the zip file, and throwing a bunch of
filesystem style metadata into the mix (using a rather different structural
approach than the zip format).


instead, in my case, I opted for 2 different paths:

for the small-scale case, and zipfiles, I will use the "read-rewrite mode"
approach, where the zipfile is opened temporarily and the contents are
entirely loaded into the cache, followed by then closing the original file.
when rewrite occures, the file is opened in write mode (which effectively
completely replaces it), and then everything in the cache is written back to
the file (more or less sequentially).

still deciding when exactly "rewrite" should be done, but I am currently
thinking "every time a file was written to and then closed" followed by "if
the volume is unmounted and still contains unwritten data".

in a lot of ways though, this mode is not very efficient (vs. the read-only
mode), eg, because everything has to be kept in memory, and a lot of things
were hacked over.


for a possible eventual larger cases, something more along the lines of a
compressed filesystem will be created (structurally similar to a hybrid of
zip and pack, but will be behave more like a filesystem than an archive
format).


>
> I think rewriting would be a better approach. I did a simple
> experiment. I got two file in txt format and then zipped them then i
> compared it with a file which was a combination of both (i'm onsidering
> it as a rewrite thing). i got the expected obvious results of gained
> compression by 20%. Even If i negate the fact that zip need to
> distinguish between the two file, still imo the overhead cant exceed
> the gain of 20%. Rewiting then doing reencoding is obvious choice,
> thou' at a expense of more time. Appending a new zipped data of a file
> to already a zipped data of the same file will have extra overhead
> while reconstruction.
>


not file concatenation yo...

for example, I take a zip file, and tell infozip to delete 2 entries, what
do I get:
a file that is smaller than the original, with no holes or similar where the
original entries used to be (implying that the file is rewritten). likewise,
similar seems to happen with replace.

if a rewrite were not occuring the file would not shrink, and there would be
holes where the original entries used to be (possibly with some distinctive
way of marking where the holes are at).

now, for any filesystem with a non-trivial number of files, this wouldn't be
good (as a rather conceited example, what if deleting 2 files from your hd
required a several hour wait while the os shifted all of your disk contents
over by a few hundred bytes?...).


that is why most filesystems contain things like the fat, spans tables,
blocks bitmaps, ...

so, zip is not good for rewrite. better to come up with something at least
marginally better (borrowing ideas from such amazing systems as zip, pack,
and ntfs...).

just, I am not storing any spans tables or similar, as my volumes would be
small enough that I can probably get away with rebuilding a lot of this
stuff at mount time (this also makes consistency checking a little easier,
as there are less structures I have to verify are sane).

also, since it is byte (rather than block) oriented, no need to worry about
end packing or wasted space on the end of blocks, and since there is no
fragmentation, most data retrieval tasks should be straightforwards.
concievably though fragmentation could help for larger files though, eg, by
having a certain amount of uncompressed data being compressed to some
variable-sized fragment, where a space could be located and it could be
placed (eg: 64 or 256kB). then again, this would add a lot of complexity,
and larger files are likely to be rare in my case.

occasionally though it would make sense to repack the files, eg: making sure
all the central directory entries are sorted correctly, and placing all the
compressed files end-to-end in the volume, all followed by the central
directory (vs. putting it wherever there is some free space).

repacking could be made an automated feature though (say, when the volume
exceeds 50% empty space, it makes sense to repack).

or something...

>



Jasen Betts

2006-04-12, 9:55 pm

On 2006-04-12, cr88192 <cr88192@NOSPAM.hotmail.com> wrote:

> I have sat around and wondered about this:
> is there any good way to handle read/write access to zip files (say, if the
> intent is to use the archive as a kind of filesystem, occasionally replacing
> contents with updated versions of the files?...).
>
> looking over the format, it does not look particularly well suited to this,
> but from the file structure it is not entirely ruled out. since various
> parts of the file have markers, conceivably one could skip over any
> unrecognized contents by searching for the markers. slightly better imo
> would be some form of padding marker (indicating the relative offset of the
> next valid header or such), but afaict zip does not define such a feature
> (nor many others that could be helpful).
>
> a quick test suggests that the zip utils are able to handle some amount of
> garbage thrown in. this implies a "hybrid" between zip and some other format
> could be possible, where the other format maintains enough context to allow
> reorganizing the contents of the file, managing free space, ..., and the zip
> tools just see some amount of garbage (presumably with any zip markers or
> other potentially confusing data wiped out).
>
> or is there some better and more common approach?...


zlibc?

> quick check seems to imply that the existing tools seem to just rewrite the
> file when doing updates, which is unlikely to be entirely acceptable in my
> case. this could be workable in some cases (modification of fairly small
> file sets), and would only require minor modifications to my existing (read
> only) filesystem code (when needed it would rewrite the file), and likely
> adding a special mount mode ("rrw", read-rewrite).


writing is a problem because the file changes length...

> or, with all this is it a better idea just to do a custom format
> outright?...
>
> this had been considered, the format imagined (this time) was closer to a
> funky hybrid of zip and the id pak format (I called it zpak, implying a
> compressed variant of pak).
>
> note, it would only rewrite whole files at a time (likely controlled via
> flush, closing the file, or when more memory needs to be freed) and would
> not support file fragmentation, rather simplifying the implementation, if
> limiting contained filesizes, but this is ok for my uses...


this would be better than zip format how exactly.
if you have all the data in ram all formats are pretty much equivalent.

Bye.
Jasen
Jim Leonard

2006-04-12, 9:55 pm

cr88192 wrote:
> I have sat around and wondered about this:
> is there any good way to handle read/write access to zip files (say, if the
> intent is to use the archive as a kind of filesystem, occasionally replacing
> contents with updated versions of the files?...).


Not sure about ZIP files, but .tar.gz is already a mountable
"filesystem" under modern versions of Linux. Meaning, you can "mount"
a .tar.gz file and access it as if it were /usr/local/data, etc.
Haven't tried this personally, so I don't know if you can write to it
as well, but I'm sure you can read from it.

cr88192

2006-04-12, 9:55 pm


"Jasen Betts" <jasen@free.net.nz> wrote in message
news:67c5.443cd28d.216ed@clunker.homenet...
> On 2006-04-12, cr88192 <cr88192@NOSPAM.hotmail.com> wrote:
>

<snip>
>
> zlibc?
>

zlib does compression/decompression, but afaik does not deal much with the
zip format (or other archive formats).

>
> writing is a problem because the file changes length...
>

yes, and that is why it is needed to be able to move things around as
needed, eg, not a stream structured, but a graph-structured format.

>
> this would be better than zip format how exactly.
> if you have all the data in ram all formats are pretty much equivalent.
>

no, my approaches only keep a minor amount of the data in ram (that which is
cached, in the normal case this limit is about 16MB at present, otherwise,
an algo is used to decide what to uncache), stuff would be pretty regularly
pulled in from disk, and written out to disk, a single sub-file at a time,
without needing to rewrite the whole archive.

using a zip rewriting approach, it is necessary to keep everything in ram,
and rewrite the file all the time. my thought was that, escaping this
requires a different format (for example, a test archive involving about
100MB of data takes a while to resave, limiting the utility to much smaller
archives).

as a result:
zip is unsuitible imo for read-write filesystem type uses.


for example, it might be conceivable that I would want to keep several GB of
data in such a format, loading into ram as such is infeasible.

for example, id pack, though seemingly not well suited to read/write access,
can be adapted to this use easily enough by tools. something changes size,
it is moved elsewhere in the file, and the space where it was before is now
available.

a lot of this is because of the file structure:
header -> points to directory;
directory -> points to each file.

no requirement where things are, the directory can be before, after, or
in-between files, and the files have no particular order, and there are no
requirements about garbage between files.

this is a much more flexible format to base from in my case.

> Bye.
> Jasen



cr88192

2006-04-12, 9:55 pm


"Jim Leonard" <MobyGamer@gmail.com> wrote in message
news:1144857226.705389.137480@t31g2000cwb.googlegroups.com...
> cr88192 wrote:
>
> Not sure about ZIP files, but .tar.gz is already a mountable
> "filesystem" under modern versions of Linux. Meaning, you can "mount"
> a .tar.gz file and access it as if it were /usr/local/data, etc.
> Haven't tried this personally, so I don't know if you can write to it
> as well, but I'm sure you can read from it.
>

in my case I doubt it.

dunno, imo tar.gz seems not that good as a fs either, as now one has the
problem of needing to decompress the whole thing (in memory at least) before
they can effectively get at the contents (which is actually worse than zip
in my case, and part of why I didn't really consider tar-gz for this).

as a side note: back in my early os-dev days, I once was using tar files
(uncompressed) as the filesystem for my disks, until I eventually got around
to implementing fat and making that the default filesystem...

me thinking: hmm, zip would make sense as a mountable fs in linux (albeit,
for the read-only case).


ok, these aren't linux filesystems, but app-local filesystems (common in my
case, though until recently I had largely been lazy and had just been
mounting directories as volumes).

zip is good enough for the read-only case, which is maybe where it will be
used (zip is at least fairly standard).

zip rewriting is good enough if the data set is fairly small (say, < 10MB).
test show it is not good for a 100MB test set (as, first off, 100MB of stuff
sits around taking up memory, second of all, rewriting said archive is not
exactly instantaneous...). much less, larger sets (say, 1GB worth of files)
would be impractical (better at this point just to use the os's
filesystem...).

traditionally, when apps have needed full-on read-write ability, they fall
back to the os filesystem. ok, just I can imagine uses where I might not
want to do it this way:
consider I am doing a modeler, and pieces of the model, skins, sub-meshes,
assembly info, ... are each stored as a file (rather to use a
big-complicated format, say, I decide to represent most pieces as simple
tabular text files). the modeler will sometimes read the files, and
sometimes write to them.
one can stick the model in its own directory, but it is often more
convinient is to pack everything together, so to the host fs it all looks
like a single file (which the user can copy around as needed).


alternatively, I need some features not really offered by conventional
filesystems either (eg: fat) making them poorly suited as well:
fat can't really do variable sized volumes, for example, nor does it support
compression, which is not that useful in my case.
ntfs is just too complicated, and there is no compelling reason to use it
(not like windows can mount arbitrary ntfs volumes as files...).

....


not much evidence exists of similar formats that I can see, so may as well
do my own.

that, or I could decide "good enough", and stick to what I have (os
directories and zip rewriting...).


NOBODY

2006-04-12, 9:55 pm

"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
news:207dd$443c5e67$ca83b3d3$12725@saipa
n.com:

> if the intent is to use the archive as a kind of filesystem,
> occasionally replacing contents with updated versions of the
> files?...).


[....]

> then again, this is not zip, and anyone who wanted to get at the
> contents would need to use a tool (probably not that big of a deal,
> but I don't know).



Well you have to know. Specs are like water: easier to walk on when
frozen. You seem to want an incremental archive like ntbackup, good old
stacker or doublespace, or any incremental archive tool.
If you have to stick with zip, the format has room for a 'patch' mode,
although they want you to have a license... I cannot tell if most zipfile
readers outthere would understand that patch mode either.

See the full document at

http://www.pkware.com/business_and_...pups/appnote.tx
t







Value Size Description
----- ---- -----------
(Patch) 0x000f 2 bytes Tag for this "extra" block type
TSize 2 bytes Size of the total "extra" block
Version 2 bytes Version of the descriptor
Flags 4 bytes Actions and reactions (see below)
OldSize 4 bytes Size of the file about to be patched
OldCRC 4 bytes 32-bit CRC of the file to be patched
NewSize 4 bytes Size of the resulting file
NewCRC 4 bytes 32-bit CRC of the resulting file

Actions and reactions

Bits Description
---- ----------------
0 Use for auto detection
1 Treat as a self-patch
2-3 RESERVED
4-5 Action (see below)
6-7 RESERVED
8-9 Reaction (see below) to absent file
10-11 Reaction (see below) to newer file
12-13 Reaction (see below) to unknown file
14-15 RESERVED
16-31 RESERVED

Actions

Action Value
------ -----
none 0
add 1
delete 2
patch 3

Reactions

Reaction Value
-------- -----
ask 0
skip 1
ignore 2
fail 3

Patch support is provided by PKPatchMaker(tm) technology and is
covered under U.S. Patents and Patents Pending. The use or
implementation in a product of certain technological aspects
set
forth in the current APPNOTE, including those with regard to
strong encryption or patching, requires a license from PKWARE.
Please contact PKWARE with regard to acquiring a license.
cr88192

2006-04-12, 9:55 pm


"NOBODY" <antispam@0.0.0.0> wrote in message
news:Xns97A3C652D4CB0nobodyantispam@207.35.177.135...
> "cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
> news:207dd$443c5e67$ca83b3d3$12725@saipa
n.com:
>
>
> [....]
>
>
>
> Well you have to know. Specs are like water: easier to walk on when
> frozen. You seem to want an incremental archive like ntbackup, good old
> stacker or doublespace, or any incremental archive tool.
> If you have to stick with zip, the format has room for a 'patch' mode,
> although they want you to have a license... I cannot tell if most zipfile
> readers outthere would understand that patch mode either.
>


yeah, I would rather do my own format than use patch mode...


note that for whatever I do, I will write my own code. I will blow off any
options expecting me to use other peoples' code...


not so much aiming for incremental backup though, so much as full-on
filesystem type usage (stacker/doublespace does this, dunno much about the
others).

by "ocassionally" I meant, "every few seconds" to "whenever the sub-file is
closed". in contrast to "every time an io operation occures". eg: it would
be unrealistic to recompress after every call to 'fwrite' or 'fputc',
instead, I do it on 'fflush' or 'fclose' style operations (this way, updates
are occasional, and writing is likely to be "in general, fast enough"...).

my deflater can be made fast enough that the costs of recompressing some
data every so often are not that high (just, as I noted elsewhere, it is
expensive when done in large bursts, eg, 100MB, where during that whole time
the app could become unresponsive and the user is annoyed as they wait for
100MB of data to suddenly recompress...).


performance doesn't need to be super high though, but something not
ludicrously slow would be good.

> See the full document at
>
> http://www.pkware.com/business_and_...pups/appnote.tx
> t
>
>

yeah, I have read a lot of this.


did design my own format for if I need it.


with some thought I ended up making it use 64 bit offsets (concievably, at
some point, I may want more than 4GB of data in the image, but sub-files
have 32 bit lengths as presumably nothing large will be stored).

also spec'ed out how fragmenting would go if I needed it (would use a flag
to indicate that a file is fragmented, and the contents refer to a list of
fragments rather than the file itself...). however, actually supporting
fragmenting in said fs code would require a more involved approach
(read/write functions would need to take into account that they can cross
fragment borders, ... and caching would likely need to be per-fragment vs.
per-file).


the spec is not finished yet though, so is still subject to change.

for general info, here are the 2 main structures of the format:
Header
{
FOURCC magic; //'ZPAK'
u32 ents; //size of directory
u64 offs; //offset of directory
}

DirEntry
{
char name[32]; //file name, 0 padded
u32 chain; //next dir entry
u32 date_time; //date modified
u16 flags; //file flags (1&=fragmented, 2&=dir)
u16 method; //method number, 0=store, 8=deflate, 9=deflate(ext)
u32 crc32; //CRC32 of data (or first dir entry)
u32 usize; //uncompressed size of data
u32 csize; //compressed size of data
u64 offset; //offset to start of data
}

as can be guessed, I am also not expecting many
"uber-long-file-names-of-excessive-typeing-death.txt" either...

<snip>


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com