Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

Re: "HTAR" archive format idea
How does it compare to the format used by 7zip?

Claudio

"cr88192" <cr88192@NOSPAM.hotmail.com> schrieb im Newsbeitrag
news:3KA2e.188$HA6.18@fe07.usenetserver.com...
> I beat together some an idea for an archive format.
> I have yet to get to writing an archiver for it, and it may change, but I
> figured I would post what I have come up with.
>
> note: in a lot of ways this is likely similar to http, but I have varied
it
> in numerous subtle ways, and there is the whole issue that it is not http
so
> I don't need to be bound that much to the spec anyways...
>
> but, anyways, if anyone feels like commenting on the general idea that
would
> be nice...
>
> ---
> Simplistic vaguely HTTP-like archive format.
>
> Considering extension 'HTAR'.
>
> Structure will consist of a number of headers, interspersed with
"content".
>
> Each header will take the form of a number of key/value pairs.
>
> Each pair will have the syntax:
> <key> ': ' <value>
>
> 1 space is to be present after the colon, any more will be interpreted as
> part of the value.
>
> eg:
> File-Name: foobar.txt
>
> Each value may be continued over multiple lines by having each subsequent
> line indented by 1 tab. The tab is not included, and no extra characters
are
> to be inserted.
>
> File-Name: foo
>         bar.txt
>
> Each line should be limited to 80 characters.
> Either a single newline or a carriage-return newline pair is allowed as a
> line seperator (hmm, probably I guess CRLF is preferred, but either should
> be accepted).
>
> Each header is terminated by a single blank line.
>
> Content will be defined as some amount of data directly following the
> header, with both it's presence and size given within the header.
> In particular, 'Content-Length' will indicate the presence and size of the
> content.
>
> Any blank lines within the inter-header space are to be ignored.
>
>
> Values
>
> Within values, C style escapes are to be used if needed, eg: \\ \t \n \r
...
> Numbers will be represented either in decimal or hex (C-style 0x
> convention).
> Commas are to be used as the general seperator.
>
> Times will have the format:
> YYYY-MM-DD hh:mm:ss [TZ]
>
> 'hh' is a 24 hour clock ranging from 00 to 23.
>
> Where TZ is +hhmm, -hhmm, or some timezone nmonic (eg: GMT).
> If Ommited, TZ should be interpreted as local time.
>
> Example:
>  2005-03-31 02:11:20 +1000
>  2005-03-31 01:11:20 +0900
>
>
> General Fields
>
> Header-Type: <typename>
>  The type of a particular header:
>  File  A single file;
>  FileGroup A group of files (packed end to end and encoded together);
>  Directory A directory.
>
>  Any unknown header types should probably be ignored.
>
> File-Name: <filename> (',' <filename> )*
>  The name of one or more files.
> File-Size: <size> (',' <size> )*
>  Uncompressed size of one or more files.
> File-ATime: <time> (',' <time> )*
> File-MTime: <time> (',' <time> )*
> File-CTime: <time> (',' <time> )*
>  Optional: file access, modification, and creation times.
>
> Other OS-specific fields could be included here, eg:
> File-Linux-Type: <mode>
> File-Linux-Mode: <mode>
> File-Linux-Dev-Major: <number>
> File-Linux-Dev-Minor: <number>
> File-Linux-UID: <uid>
> File-Linux-GID: <gid>
> ...
>
> Or even implementation-specific fields:
> File-libfoo-bar: <string>
> File-libfoo-baz: <number>
> ...
>
> Content-Encoding: <algoname>
>  Algorithm used for encoding the content.
> Content-Length: <size>
>  Size of the header's content in the archive.
> Content-Type: <typename>
>  Mime type of content (optional and likely irrelevant).
>
>
>



Report this thread to moderator Post Follow-up to this message
Old Post
Claudio Grondi
03-31-05 01:55 AM


Re: "HTAR" archive format idea
"Claudio Grondi" <claudio.grondi@freenet.de> wrote in message
news:3b0ggpF69bvisU1@individual.net...
> How does it compare to the format used by 7zip?
>
well, first off, it is completely different...

7zip uses a binary tv style format apparently consisting of byte prefixes
followed by prefix specific data, and encodes larger numbers with a vli
scheme.

imo, it is almost an exact opposite:
mine mostly text, 7z binary;
mine open tag/value structure, 7z has a fairly fixed structure;
mine should be easy to extend independantly, 7z will likely require
centralized activity;
mine has minimal concern for format overhead, 7z has massive concern for
overhead (eg: 7z uses individual bytes and bitpacking often, wheras mine
represents numbers in plaintext...);
...


my main reasoning is primarily that the headers are likely to be smaller
than the files anyways, so a little bloat is probably no big deal.

the file is likely to be read/written in binary mode anyways, so things like
sing are expected (including, eg, possibly space padding numbers so it is
possible to s back to them and fill them in later or such).

otherwise, I might not want writer sing, so chunking may make sense
instead. I guess it would depend on the writer.

may as well keep the footer (defined as another header).
dunno, could put cumulative compressed crc's there or something.

Example of which would be, eg, allowing:

Content-Type: chunked

12
Hello All,
13
The Next Part
Content-Adler32: *


or whatever...




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
03-31-05 08:56 AM


Re: "HTAR" archive format idea
"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
news:gUH2e.218$HA6.71@fe07.usenetserver.com:

> my main reasoning is primarily that the headers are likely to be
> smaller than the files anyways, so a little bloat is probably no big
> deal.

Actually I disagree with your reasoning, as you're adding aditional bloat
to the code to manage the archive, is much easier to work with a binary
file with a fixed structure or a dynamic one, than with a text one, since
you need to parse the headers back into a format useable from your program.

If you're looking for a chunkable format, take a look at the old amiga IFF
format or the PNG one as examples.

See ya

Report this thread to moderator Post Follow-up to this message
Old Post
Vicente Werner
03-31-05 08:55 PM


Re: "HTAR" archive format idea
"Vicente Werner" <Nothin@nothing.com> wrote in message
 news:Xns962ACE7A12A11notasinglethingofmy
i@216.196.109.144...
> "cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
> news:gUH2e.218$HA6.71@fe07.usenetserver.com:
> 
>
> Actually I disagree with your reasoning, as you're adding aditional bloat
> to the code to manage the archive, is much easier to work with a binary
> file with a fixed structure or a dynamic one, than with a text one, since
> you need to parse the headers back into a format useable from your
> program.
>
yes, I know, it is more awkward, but it should be far more extensible
without risking breaking existing tools...

I am now considering adding simplistic compression to reduce the header
bloat, at the cost of a lot more code bloat.

anyways, I was never saying text was a "convinient" way of doing the
archives, only that the structure should be tolerable, and is in most ways
almost exactly the opposite of the 7z format...


a more persuasive argument would have been that the headers were
unreasonably large, which I might have dealt better with, this is partly why
I am considering the compression (at the cost of losing the format being
mostly textual...).

this should not signifigantly hurt extensibility.


values <128: passed through clean.
values >128: interpreted as lz values
128 (run 0): interpreted as an escape, 1 byte escaped value
129 (run 1): reserved, possibly length-prefixed escape
130 (run 2): reserved
131..255: sane run values
next 2 bytes are offset

(note: format most likely to do poorly on binary data given mass
escaping...).

I have put a little bit of thought into how to try for a fast stream-centric
decoder. buffer would be circular with 2 pointers (current read position and
current end of decoded data).
wrapping is a difficult issue performance-wise.

encoder, should be able to get a reasonably fast one.
for a 64k dictionary, I am imagining needing approx 320kB of memory (256k of
which would be related to hashing). I put a little thought into figuring how
to do wrap-around with the hash data and efficiently handling the wraparound
issues.

slower than doing nothing, but might be acceptable.


note: I am now thinking this approach may be relevant to xml, at the cost of
being slower than a true binary xml. given I am considering an algo which
does not use entropy coding encoding/decoding should be faster than with one
that does (and faster than my other decoders, since it will be possible to
detect runs with a single mask and the run structure is fixed).

a plain linear decoder would likely be faster than the ring-based one I am
imagining, but a ring-based one reduces memory-related concerns, and should
be better to handle the mixing of encoded and non-encoded data...

not like I need that much speed in accessing the headers anyways though...

> If you're looking for a chunkable format, take a look at the old amiga IFF
> format or the PNG one as examples.
>
I looked at those, but decided against them on the grounds that they make
extensibility more complicated (yes, IFF and PNG have generally extensible
formats, but imo, there are likely a lot more issues than there would be,
eg, with something closer to http...).


or such...




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
04-01-05 08:55 AM


Re: "HTAR" archive format idea
Hy;

You should take a look at a DICOM implementation
if you want to know how to handle proprietary
Archive-information fitting market needs.

DICOMs amount of Archive-information is horrible
much. Maybe you can detect the heaviest flaws and
find better ways.

Ciao
Niels

Report this thread to moderator Post Follow-up to this message
Old Post
Niels Fröhling
04-01-05 08:55 PM


Re: "HTAR" archive format idea
"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
news:Qm13e.15$A71.1@fe07.usenetserver.com:

> yes, I know, it is more awkward, but it should be far more extensible
> without risking breaking existing tools...

A properly binary format does also the same, take a look at IFF for
example, or PNG.

> anyways, I was never saying text was a "convinient" way of doing the
> archives, only that the structure should be tolerable, and is in most
> ways almost exactly the opposite of the 7z format...
I don't think it's going to be tolerable, at the end the chunk of code
needed to deal with just the headers will be HUGE regarding the size of
it.

> a more persuasive argument would have been that the headers were
> unreasonably large, which I might have dealt better with, this is
> partly why I am considering the compression (at the cost of losing the
> format being mostly textual...).
The overhead I'm worried about is not that one, it's the one at the code.

> I looked at those, but decided against them on the grounds that they
> make extensibility more complicated (yes, IFF and PNG have generally
> extensible formats, but imo, there are likely a lot more issues than
> there would be, eg, with something closer to http...).

Of course there're always a limit on how much you can expand a format or
how much you can do with it, no matter how you design it, at the end
there's allways a point where something needs to be changed and break
compatibility to do it.

For example I do not think your system will be realistic to deal with
delete operations on very large archives, with 100000+ of compressed
items, or adding error recovery records intra item will likely impose
heavy overheads.


Report this thread to moderator Post Follow-up to this message
Old Post
Vicente Werner
04-01-05 08:55 PM


Re: "HTAR" archive format idea
"Vicente Werner" <Nothin@nothing.com> wrote in message
 news:Xns962BBDB2AC15Enotasinglethingofmy
i@216.196.109.144...
> "cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
> news:Qm13e.15$A71.1@fe07.usenetserver.com:
> 
>
> A properly binary format does also the same, take a look at IFF for
> example, or PNG.
>
I know both formats.

they are extensible, but one has to worry a little more about behavior by
tools upon encounter of unknown chunks (png specifies this a little more
than iff does), one also has to worry more about fourcc clash, wheras with
plaintext one can generate much longer names.
 
> I don't think it's going to be tolerable, at the end the chunk of code
> needed to deal with just the headers will be HUGE regarding the size of
> it.
>
yes, I know...
 
> The overhead I'm worried about is not that one, it's the one at the code.
>
ok.

I wrote a basic parser/dumper allready, and it would not be too hard to
modify it into a decompressor.

at present, vars are parsed and stuffed into locals.
mostly I am thinking of having a struct which would hold all the known
parsed vars, and dispatching the using the struct (header-type and whatever)
to perform the decode.

 
>
> Of course there're always a limit on how much you can expand a format or
> how much you can do with it, no matter how you design it, at the end
> there's allways a point where something needs to be changed and break
> compatibility to do it.
>
yes, I know as well, just afaik, with an IFF or PNG style format, this
threashold is likely to be a little lower.
of course, one could use plaintext for the file-info, but then again, same
problem.

I started designing a format like this allready, and had realized that
compound entries would be a signifigant design issue with such a format, but
not so big of a deal with text.

> For example I do not think your system will be realistic to deal with
> delete operations on very large archives, with 100000+ of compressed
> items, or adding error recovery records intra item will likely impose
> heavy overheads.
>
nope, it wont probably...
I aim low up front, but I still hope for a flexible format (eg: one that can
possibly be easily customized for "experimental" uses or whatever), or
allowing patching in 3rd party tools (eg: want bzip2 support, just add an
entry in the config file, and hope the person decompressing did similar).

as a result, likely I am going to be calling external tools for compression
and decompression.

most more generic archive use though consists of just archiving a directory
or unpacking files into a directory.

it is likely to beat out tar though, as it will be possible to read file
lists without decompressing the whole file.

it will also be a lot more extensible than either tar or zip.

or such...




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
04-02-05 01:55 AM


Re: "HTAR" archive format idea
Hy;

> err, somehow I get the idea this is not a file archiver...

The general solution you try reach with your open text
attributes is not (only) specific to an archiver.
In your concept you tag files, and DICOM graphic files
may give you a good idea about that.

Ciao
Niels


Report this thread to moderator Post Follow-up to this message
Old Post
Niels Fröhling
04-06-05 05:40 PM


Re: "HTAR" archive format idea
"Niels Fröhling" <niels.froehling@seies.de> wrote in message
news:d2mnbo$alt$1@domitilla.aioe.org...
> Hy;
> 
>
> The general solution you try reach with your open text
> attributes is not (only) specific to an archiver.
> In your concept you tag files, and DICOM graphic files
> may give you a good idea about that.
>
oh, ok then.




Report this thread to moderator Post Follow-up to this message
Old Post
cr88192
04-06-05 05:40 PM


Re: "HTAR" archive format idea
"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in
news:Sgl3e.3062$A71.493@fe07.usenetserver.com:
> they are extensible, but one has to worry a little more about behavior
> by tools upon encounter of unknown chunks (png specifies this a little
> more than iff does), one also has to worry more about fourcc clash,
> wheras with plaintext one can generate much longer names.
The first argument is not a fault of the format, it's a failure of the
applications dealing with them, as for the second, if you do use 4 bytes,
you've 2^32 posibilities... hard to belive you'll get into clashes.

> yes, I know as well, just afaik, with an IFF or PNG style format, this
> threashold is likely to be a little lower.
You still haven't show a point backing that argument.

> I started designing a format like this allready, and had realized that
> compound entries would be a signifigant design issue with such a
> format, but not so big of a deal with text.
Why?

> I aim low up front, but I still hope for a flexible format (eg: one
> that can possibly be easily customized for "experimental" uses or
> whatever), or allowing patching in 3rd party tools (eg: want bzip2
> support, just add an entry in the config file, and hope the person
> decompressing did similar).
Adding human intervention will only make your system less usable

> it is likely to beat out tar though, as it will be possible to read
> file lists without decompressing the whole file.
Please don't compare apples with oranges, they're different ! Tar was
designed long time ago as just an archival format, they didn't even think
of file by file compression, so it's not a valid reference, nor benchmark.
A valid reference will be any of the current fileformats without
compression.

Report this thread to moderator Post Follow-up to this message
Old Post
Vicente Werner
04-06-05 05:40 PM


Sponsored Links




Last Thread Next Thread Next
Pages (3): [1] 2 3 »
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 07:00 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.