Code Comments
Programming Forum and web based access to our favorite programming groups.How does it compare to the format used by 7zip?
Claudio
"cr88192" <cr88192@NOSPAM.hotmail.com> schrieb im Newsbeitrag
news:3KA2e.188$HA6.18@fe07.usenetserver.com...
> I beat together some an idea for an archive format.
> I have yet to get to writing an archiver for it, and it may change, but I
> figured I would post what I have come up with.
>
> note: in a lot of ways this is likely similar to http, but I have varied
it
> in numerous subtle ways, and there is the whole issue that it is not http
so
> I don't need to be bound that much to the spec anyways...
>
> but, anyways, if anyone feels like commenting on the general idea that
would
> be nice...
>
> ---
> Simplistic vaguely HTTP-like archive format.
>
> Considering extension 'HTAR'.
>
> Structure will consist of a number of headers, interspersed with
"content".
>
> Each header will take the form of a number of key/value pairs.
>
> Each pair will have the syntax:
> <key> ': ' <value>
>
> 1 space is to be present after the colon, any more will be interpreted as
> part of the value.
>
> eg:
> File-Name: foobar.txt
>
> Each value may be continued over multiple lines by having each subsequent
> line indented by 1 tab. The tab is not included, and no extra characters
are
> to be inserted.
>
> File-Name: foo
> bar.txt
>
> Each line should be limited to 80 characters.
> Either a single newline or a carriage-return newline pair is allowed as a
> line seperator (hmm, probably I guess CRLF is preferred, but either should
> be accepted).
>
> Each header is terminated by a single blank line.
>
> Content will be defined as some amount of data directly following the
> header, with both it's presence and size given within the header.
> In particular, 'Content-Length' will indicate the presence and size of the
> content.
>
> Any blank lines within the inter-header space are to be ignored.
>
>
> Values
>
> Within values, C style escapes are to be used if needed, eg: \\ \t \n \r
...
> Numbers will be represented either in decimal or hex (C-style 0x
> convention).
> Commas are to be used as the general seperator.
>
> Times will have the format:
> YYYY-MM-DD hh:mm:ss [TZ]
>
> 'hh' is a 24 hour clock ranging from 00 to 23.
>
> Where TZ is +hhmm, -hhmm, or some timezone nmonic (eg: GMT).
> If Ommited, TZ should be interpreted as local time.
>
> Example:
> 2005-03-31 02:11:20 +1000
> 2005-03-31 01:11:20 +0900
>
>
> General Fields
>
> Header-Type: <typename>
> The type of a particular header:
> File A single file;
> FileGroup A group of files (packed end to end and encoded together);
> Directory A directory.
>
> Any unknown header types should probably be ignored.
>
> File-Name: <filename> (',' <filename> )*
> The name of one or more files.
> File-Size: <size> (',' <size> )*
> Uncompressed size of one or more files.
> File-ATime: <time> (',' <time> )*
> File-MTime: <time> (',' <time> )*
> File-CTime: <time> (',' <time> )*
> Optional: file access, modification, and creation times.
>
> Other OS-specific fields could be included here, eg:
> File-Linux-Type: <mode>
> File-Linux-Mode: <mode>
> File-Linux-Dev-Major: <number>
> File-Linux-Dev-Minor: <number>
> File-Linux-UID: <uid>
> File-Linux-GID: <gid>
> ...
>
> Or even implementation-specific fields:
> File-libfoo-bar: <string>
> File-libfoo-baz: <number>
> ...
>
> Content-Encoding: <algoname>
> Algorithm used for encoding the content.
> Content-Length: <size>
> Size of the header's content in the archive.
> Content-Type: <typename>
> Mime type of content (optional and likely irrelevant).
>
>
>
Post Follow-up to this message"Claudio Grondi" <claudio.grondi@freenet.de> wrote in message news:3b0ggpF69bvisU1@individual.net... > How does it compare to the format used by 7zip? > well, first off, it is completely different... 7zip uses a binary tv style format apparently consisting of byte prefixes followed by prefix specific data, and encodes larger numbers with a vli scheme. imo, it is almost an exact opposite: mine mostly text, 7z binary; mine open tag/value structure, 7z has a fairly fixed structure; mine should be easy to extend independantly, 7z will likely require centralized activity; mine has minimal concern for format overhead, 7z has massive concern for overhead (eg: 7z uses individual bytes and bitpacking often, wheras mine represents numbers in plaintext...); ... my main reasoning is primarily that the headers are likely to be smaller than the files anyways, so a little bloat is probably no big deal. the file is likely to be read/written in binary mode anyways, so things like sing are expected (including, eg, possibly space padding numbers so it is possible to s
back to them and fill them in later or such). otherwise, I might not want writer s
ing, so chunking may make sense instead. I guess it would depend on the writer. may as well keep the footer (defined as another header). dunno, could put cumulative compressed crc's there or something. Example of which would be, eg, allowing: Content-Type: chunked 12 Hello All, 13 The Next Part Content-Adler32: * or whatever...
Post Follow-up to this message"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in news:gUH2e.218$HA6.71@fe07.usenetserver.com: > my main reasoning is primarily that the headers are likely to be > smaller than the files anyways, so a little bloat is probably no big > deal. Actually I disagree with your reasoning, as you're adding aditional bloat to the code to manage the archive, is much easier to work with a binary file with a fixed structure or a dynamic one, than with a text one, since you need to parse the headers back into a format useable from your program. If you're looking for a chunkable format, take a look at the old amiga IFF format or the PNG one as examples. See ya
Post Follow-up to this message"Vicente Werner" <Nothin@nothing.com> wrote in message news:Xns962ACE7A12A11notasinglethingofmy i@216.196.109.144... > "cr88192" <cr88192@NOSPAM.hotmail.com> wrote in > news:gUH2e.218$HA6.71@fe07.usenetserver.com: > > > Actually I disagree with your reasoning, as you're adding aditional bloat > to the code to manage the archive, is much easier to work with a binary > file with a fixed structure or a dynamic one, than with a text one, since > you need to parse the headers back into a format useable from your > program. > yes, I know, it is more awkward, but it should be far more extensible without risking breaking existing tools... I am now considering adding simplistic compression to reduce the header bloat, at the cost of a lot more code bloat. anyways, I was never saying text was a "convinient" way of doing the archives, only that the structure should be tolerable, and is in most ways almost exactly the opposite of the 7z format... a more persuasive argument would have been that the headers were unreasonably large, which I might have dealt better with, this is partly why I am considering the compression (at the cost of losing the format being mostly textual...). this should not signifigantly hurt extensibility. values <128: passed through clean. values >128: interpreted as lz values 128 (run 0): interpreted as an escape, 1 byte escaped value 129 (run 1): reserved, possibly length-prefixed escape 130 (run 2): reserved 131..255: sane run values next 2 bytes are offset (note: format most likely to do poorly on binary data given mass escaping...). I have put a little bit of thought into how to try for a fast stream-centric decoder. buffer would be circular with 2 pointers (current read position and current end of decoded data). wrapping is a difficult issue performance-wise. encoder, should be able to get a reasonably fast one. for a 64k dictionary, I am imagining needing approx 320kB of memory (256k of which would be related to hashing). I put a little thought into figuring how to do wrap-around with the hash data and efficiently handling the wraparound issues. slower than doing nothing, but might be acceptable. note: I am now thinking this approach may be relevant to xml, at the cost of being slower than a true binary xml. given I am considering an algo which does not use entropy coding encoding/decoding should be faster than with one that does (and faster than my other decoders, since it will be possible to detect runs with a single mask and the run structure is fixed). a plain linear decoder would likely be faster than the ring-based one I am imagining, but a ring-based one reduces memory-related concerns, and should be better to handle the mixing of encoded and non-encoded data... not like I need that much speed in accessing the headers anyways though... > If you're looking for a chunkable format, take a look at the old amiga IFF > format or the PNG one as examples. > I looked at those, but decided against them on the grounds that they make extensibility more complicated (yes, IFF and PNG have generally extensible formats, but imo, there are likely a lot more issues than there would be, eg, with something closer to http...). or such...
Post Follow-up to this messageHy; You should take a look at a DICOM implementation if you want to know how to handle proprietary Archive-information fitting market needs. DICOMs amount of Archive-information is horrible much. Maybe you can detect the heaviest flaws and find better ways. Ciao Niels
Post Follow-up to this message"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in news:Qm13e.15$A71.1@fe07.usenetserver.com: > yes, I know, it is more awkward, but it should be far more extensible > without risking breaking existing tools... A properly binary format does also the same, take a look at IFF for example, or PNG. > anyways, I was never saying text was a "convinient" way of doing the > archives, only that the structure should be tolerable, and is in most > ways almost exactly the opposite of the 7z format... I don't think it's going to be tolerable, at the end the chunk of code needed to deal with just the headers will be HUGE regarding the size of it. > a more persuasive argument would have been that the headers were > unreasonably large, which I might have dealt better with, this is > partly why I am considering the compression (at the cost of losing the > format being mostly textual...). The overhead I'm worried about is not that one, it's the one at the code. > I looked at those, but decided against them on the grounds that they > make extensibility more complicated (yes, IFF and PNG have generally > extensible formats, but imo, there are likely a lot more issues than > there would be, eg, with something closer to http...). Of course there're always a limit on how much you can expand a format or how much you can do with it, no matter how you design it, at the end there's allways a point where something needs to be changed and break compatibility to do it. For example I do not think your system will be realistic to deal with delete operations on very large archives, with 100000+ of compressed items, or adding error recovery records intra item will likely impose heavy overheads.
Post Follow-up to this message"Vicente Werner" <Nothin@nothing.com> wrote in message news:Xns962BBDB2AC15Enotasinglethingofmy i@216.196.109.144... > "cr88192" <cr88192@NOSPAM.hotmail.com> wrote in > news:Qm13e.15$A71.1@fe07.usenetserver.com: > > > A properly binary format does also the same, take a look at IFF for > example, or PNG. > I know both formats. they are extensible, but one has to worry a little more about behavior by tools upon encounter of unknown chunks (png specifies this a little more than iff does), one also has to worry more about fourcc clash, wheras with plaintext one can generate much longer names. > I don't think it's going to be tolerable, at the end the chunk of code > needed to deal with just the headers will be HUGE regarding the size of > it. > yes, I know... > The overhead I'm worried about is not that one, it's the one at the code. > ok. I wrote a basic parser/dumper allready, and it would not be too hard to modify it into a decompressor. at present, vars are parsed and stuffed into locals. mostly I am thinking of having a struct which would hold all the known parsed vars, and dispatching the using the struct (header-type and whatever) to perform the decode. > > Of course there're always a limit on how much you can expand a format or > how much you can do with it, no matter how you design it, at the end > there's allways a point where something needs to be changed and break > compatibility to do it. > yes, I know as well, just afaik, with an IFF or PNG style format, this threashold is likely to be a little lower. of course, one could use plaintext for the file-info, but then again, same problem. I started designing a format like this allready, and had realized that compound entries would be a signifigant design issue with such a format, but not so big of a deal with text. > For example I do not think your system will be realistic to deal with > delete operations on very large archives, with 100000+ of compressed > items, or adding error recovery records intra item will likely impose > heavy overheads. > nope, it wont probably... I aim low up front, but I still hope for a flexible format (eg: one that can possibly be easily customized for "experimental" uses or whatever), or allowing patching in 3rd party tools (eg: want bzip2 support, just add an entry in the config file, and hope the person decompressing did similar). as a result, likely I am going to be calling external tools for compression and decompression. most more generic archive use though consists of just archiving a directory or unpacking files into a directory. it is likely to beat out tar though, as it will be possible to read file lists without decompressing the whole file. it will also be a lot more extensible than either tar or zip. or such...
Post Follow-up to this messageHy; > err, somehow I get the idea this is not a file archiver... The general solution you try reach with your open text attributes is not (only) specific to an archiver. In your concept you tag files, and DICOM graphic files may give you a good idea about that. Ciao Niels
Post Follow-up to this message"Niels Fröhling" <niels.froehling@seies.de> wrote in message news:d2mnbo$alt$1@domitilla.aioe.org... > Hy; > > > The general solution you try reach with your open text > attributes is not (only) specific to an archiver. > In your concept you tag files, and DICOM graphic files > may give you a good idea about that. > oh, ok then.
Post Follow-up to this message"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in news:Sgl3e.3062$A71.493@fe07.usenetserver.com: > they are extensible, but one has to worry a little more about behavior > by tools upon encounter of unknown chunks (png specifies this a little > more than iff does), one also has to worry more about fourcc clash, > wheras with plaintext one can generate much longer names. The first argument is not a fault of the format, it's a failure of the applications dealing with them, as for the second, if you do use 4 bytes, you've 2^32 posibilities... hard to belive you'll get into clashes. > yes, I know as well, just afaik, with an IFF or PNG style format, this > threashold is likely to be a little lower. You still haven't show a point backing that argument. > I started designing a format like this allready, and had realized that > compound entries would be a signifigant design issue with such a > format, but not so big of a deal with text. Why? > I aim low up front, but I still hope for a flexible format (eg: one > that can possibly be easily customized for "experimental" uses or > whatever), or allowing patching in 3rd party tools (eg: want bzip2 > support, just add an entry in the config file, and hope the person > decompressing did similar). Adding human intervention will only make your system less usable > it is likely to beat out tar though, as it will be possible to read > file lists without decompressing the whole file. Please don't compare apples with oranges, they're different ! Tar was designed long time ago as just an archival format, they didn't even think of file by file compression, so it's not a valid reference, nor benchmark. A valid reference will be any of the current fileformats without compression.
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.