| nightlight 2006-01-22, 9:55 pm |
| > Appearently so far it was to hard to get QI to
> fit in the world of Moffat and use real files
> so anyone could check with the files of there
> choice.
QI source code & executable (which are available for anyone to
verify as they wish), gives you OutSizeQI in bits to use for
the compression ratio vs any AC you may have:
RC = OutSizeAC [bits] / OutSizeQI [bits] ... (1)
using memory buffers that you can fill with any data you wish.
When you can make a rational explanation as to how writing the
buffers to a file and reading them from it, can change RC in
(1), then I will consider your suggestion for "upgrading" the
QIC kit with file I/O. So far, you haven't given a _single
rational reason_ as to how RC in (1) can change by mere writing
& reading the memory buffers to/from the disk. We all know why
you don't want to say it, see the method (b) described here:
http://groups.google.com/group/comp...d62e372056d9d53
All you have been offering is a repetitive lame whine like the
one above: "so anyone could check with the files of their
choice..." which is a complete BS, given that with the QI source
_publicly available_ anyone can already check on any input
pattern given to the coder via a memory buffer (which includes,
obviously, any pattern you may read & write from/to a file).
When you give a rational reason that is consistent with all the
facts, such as the availability of QI source at:
http://www.1stworks.com/ref/qi.htm
and the intended audience & purpose for the source release:
http://groups.google.com/group/comp...d8dcafd8b947ea1
then I will listen. Otherwise you might as well insist that the
file used in the QI "upgrade" you are demanding for "real" test
must also be named as SCOTT1234.txt to count as a "real" test in
your book (which will affect RC in (1) by about as much as
any other file name, or any file at all).
A difference that adding file i/o would make, though, regarding
the QI source kit would be to lower the signal to noise ratio
on: (A1) the speed advantage over AC, as well on (A3) (the
optimality e.g. by insisting on counting output sizes in bytes
only) and (A4) (stability & availability of the precise output
size). Why would I care, given the absence of any rational basis
to believe that RC in (1) will change from writing memory
buffers to a file, to waste time and to expand the QI source
size just to add gratuitous noise on the top of the (A1),(A3) &
(A4) signals?
> Look you picked the 3 alphabet size so that AC would
> look bad...
It was you who proposed A=3 test (your ABC), not me. I only
responded when Matt Mahoney reported his first results on your
ABC test "challenge":
http://groups.google.com/group/comp...b1fed7f8181bd31
that the QI.exe included in the source kit already has a command
line which can give you the answer for symbols 0,1,2 instead of
ABC for N=10^6 (or with recompiling for N=10^7). My response on
Matt's post was here:
http://groups.google.com/group/comp...f1ee67d18b63f5a
which you questioned (without apparently reading it at all since
all you asked was already answered in the post). Then all of it
re-re-re-explained to you again here:
http://groups.google.com/group/comp...08e96ebcb4577f1
and more here (on fractional bit sizes):
http://groups.google.com/group/comp...e015f38d228969e
and here few more times:
http://groups.google.com/group/comp...d77316462d9ee42
http://groups.google.com/group/comp...a6773533448f7de
http://groups.google.com/group/comp...9ff2ec175f1a5ed
http://groups.google.com/group/comp...7be8670d64e9f25
.... etc, round and around. You ought to go read all that before
asking the same things for the 30-th time.
As to the tests results on A3, Matt showed his results, I got
Moffat98 results, all consistently showing significant AC output
excess over QI. Since you rejected to test (or at least didn't
report your results) on a proper random sample, as explained in
the posts above, you have no claim to be compared to anything.
The point made in (A2) with A32, which for whatever reason you
keep misstating exactly upside down in your post above, is that
some larger alphabet A32 >> 3 will only increase the O(N)
component of the AC excess vs QI, given as D in (A2). This
component will grow as A32/3. QI.exe will show you the excess
for any A up to A=2^32-1 (as already explained, use QI.exe cmd
line "cr<alphabet_size_A>" to see it).
> If this is so why could you not complete the contest?
There is no contest left regarding the ratio of output sizes RC.
As to your loop:
while(1) printf("Yeah, but what about file i/o?");
you first need to explain how the file i/o will change the ratio
RC in (1) the answer for which we already know for memory buffer
tests, Mahoney & Moffat98 providing AC sizes in (1) and QI.exe
providing QI size in (1) (which anyone can verify with QI source
kit as explained 30 times before).
> First of all not sure you have a grasp of what you are proposing.
> If one had such a large number of symbol 2**32 its not likely
> that for any reasonable length file any symobol would appear
> more than once. Have you actually thought this through?
It appears, someone else here might need a bit of "grasping".
Take some alphabet size A such that 2^31 < A < 2^32, for our N=
10^6 and run your AC on it. For example, I just ran QI.exe using
the command line (you are welcome to verify this):
QI cr3133061822 n1e6 i10
which is a test for A=3,133,061,822 N=10^6 and i=10 iterations
with randomly generated inputs (the iter doesn't matter for QI
since, as pointed out in (A4), QI produces always exactly the
same size, including the max index value which in this case will
always have the top 32 bits smaller than 0x8865BAF8) and the QI
output size for the index is:
Q = 31544926.09167... bits
which is about 2.422e-4 bits above the lower bound N*log(A) for
this index. That amounts to QI excess of 2.422e-10 bits per
symbol. You can run your AC on this A & N and tell me what you
get. Measure the total size and the maximum variation in size
(with AC you get only whole bits since its exact upper bound on
the index is obliterated due to normalization of index to 1.0,
which in turn is done to comply with AC's coarse-grained
probabilistic parametrization for finite sequences, enumeration
them on the cumulative probability scale, as explained in (A2)).
Recall also that just storing 32 bits/symbol for N=10^6 will use
455073 bits _more_ than the QI's output 31544927 in whole bits.
> It's interesting to note that the file size I compressed
> to for the contest didn't seem to fuctuate at all for a
> wide range of input messages which were all the came
> length and perfectly equiprobale. ...
> I love this its really funny. QI "down to the exact bit
> fraction" Its really funny when you think about it.
The question is not "file size" (which is rounded to next byte
at best, or even sector or cluster on the disk) but size in bits
(or bit fractions). As explained in (A4) on several examples,
there are many practically important cases when size in bits,
and even size in bit fractions matters (whenever you have many
such little leftovers). That you can imagine or cite cases where
such precision doesn't matter, such as when storing output into
a free size file on the disk, is a non sequitur for the point
made in (A4). The point in (A4) is that there are practically
important cases where such precision does matter and there QI is
distinctly ahead of AC (let alone Huffman).
That the bit fractions do matter (and what they mean), it should
be already obvious even from our tests on A=3, where we were
trying to pack symbol x<3 as closely to the exact bit fraction
log(3)=1.584962... as possible, instead of just storing each
symbol in 2 whole bits, or coding it in Huffman code as e.g. 0:0
1:10 2:11 (which gets you 1.66.. bits/sym on average).
The point (A4) is in this case (A=3) that if you need to store
compressed inputs with N=10^6 symbols into fixed size fields (or
skip quickly over compressed data in a larger package containing
them), with the flat 2-bit code for symbol you will need to
reserve space for 2*10^6 bits. Huffman, which on average reaches
1.66.,. bits/symbol will also need to reserve 2*10^6 bits to
guarantee that _all_ sequences from the set of possible inputs
(I called it Set_QI) will fit into the reserved space. With AC,
depending on implementation you may need about 20-60 bits, as in
Mahoney's reported tests, to guarantee fit for _all_ possible
A=3 N=10^6 sequences { much more for Moffat98 which has a
sub-optimally skewed quantization as explained here:
http://groups.google.com/group/comp...fa6336f483bbb89
in a post to Thomas Richter).
The alternative to reserving the extra space with AC and
Huffman, in cases where you only need to traverse, within a
larger package of items, the compressed data _quickly_ (which
means without decompressing) but not store it into the fixed
size record/packet fields is to store the compressed length
separately, in addition to storing N and A (for self-contained
compressed package).
In contrast, with QI coding these A=3, N=10^6 inputs, you not
only can know that its index will certainly fit in precisely
1584963 whole bits, but you _also_ know from the quantized
power q[A^N] (which is the QI table entry) that the leading 32
bits of the index will be always smaller than the 32 bit value
0xB521509A (which means you can package the index via mixed
radix codes with other items to reclaim the N*log(3) bit
fraction to within 1/2^31 bits from the exact N*log(3) lower
bound). QI also does not need to store separately the compressed
length -- all it needs are A and N values, which it already
needs anyway (just like the other coders do need A and N in some
form) to know the exact index size and its exact upper bound.
This precision is consequence of QI output optimality (it
produces index closest to N*log(3) for any g-bit addends which
also satisfy Kraft inequality eq. (20) in [T3]) and its finer-
granularity parametrization (A2). That was all explained at
length in the earlier posts cited above. The (A4) only
summarizes the main points.
|