Home > Archive > Compression > March 2006 > Re: Please, PLEASE, hold your questions/comments/elsewhat til the end. Thank you. :)
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Please, PLEASE, hold your questions/comments/elsewhat til the end. Thank you. :)
|
|
| Threshold 2006-03-30, 9:55 pm |
| Guess what? This one's long, too. I've done a bit of quoting, and a
bit of explaining why it doesn't apply.
I apologize for coming off a bit, well, rude I guess, and it's
certainly a defensive post, but I don't believe what I've had to say
has been properly considered. I suppose it might help if I could make
it more concise, such that people could read it inside a minute or so,
but I'm not sure I know how to do that with this. Unfortunately, being
vague takes up more space than being specific-at least in this case.
And if you expect me to be specific when I believe I could be
correct-you should consider what you'd do.
I've read the FAQ, several times in fact, over the past two years or so
(and as I recall, it hasn't changed much, if at all, which has also
been specified elsewhere in this group), and I still don't understand
why you think I'm so incredibly wrong. Or, if it is the case, why I
-AM- so incredibly wrong. I'm not entirely sure what you mean by "not
even wrong", but it seems as though you meant no harm, and no harm was
done. I'd -like- to think that "not even wrong" is better than
"wrong", but again, not sure exactly what you mean by that.
Just in CASE I managed to miss some refutation of what I've presented,
I did a re-read of it.
"The flaw in these arguments is that the the "house-keeping" type
information
is *not* negligible. If it is properly taken it into account, it
cancels any
gains made elsewhere when attempting to compress random data."
I have a problem with this. The method that I've come up with doesn't
fall into that category-and as far as lack of knowledge, isn't that the
basis of asking questions? I didn't come in here thinking I was right,
I came in here with a theory and asked to have it refuted.
The general theory on why one cannot compress random files, other than
the counting principle/argument/theorem, is that random files have no
pattern. Now maybe random files don't have a pattern-but this doesn't
require a pattern; it just requires repetition of byte pairs. I know
you've got to be familiar with BPE-and I'm aware that it was used and
intended for text, or for files with a small amount of unique bytes.
But I'm of the opinion that I've fixed that, and every single file over
65,536 bytes BY DEFINITION -has- to have at least one byte pair that
repeats. Now, at that level, the overhead would probably be more than
that which was saved-but I didn't design this for small files. Look at
a file say 1,048,576 bytes in length and tell me how many byte pairs
you're probably going to have repeat. Order doesn't particularly
matter; I believe I can get around that.
"It is commonly stated that perfectly entropic data streams cannot be
compressed. This misbelief is in part based on the sobering fact that
for a
large set of entropic data, calculating the number of possible bit
pattern
combinations is unfathomable. For example, if 100 ones and 100 zeros
are
randomly distributed in a block 200 bits long, there are
200C100 = 9.055 10^58
combinations possible. The numbers are clearly unmanageable and hence
the
inception that perfectly entropic data streams cannot be compressed.
The
key to the present compression method under discussion is that it
makes no
attempt to deal with such large amounts of data and simply operates
on
smaller portions."
Riiight. As you may have noticed, I didn't use that argument, nor
anything like it. I know the numbers aren't unmanageable, and it's
easier to refer to them as 2^N, etc, anyway. FACT: Counting principle
does not rule out compressing a file X to X-Y length with a
decompression scheme of Y length, for a total archive size of X. FACT:
2^N + 2^N-1 + 2^N-2, etc, is ALWAYS GREATER THAN 2^N.
Upon my mentioning of my idea to a friend of mine, he very carefully
reminded me I have only 256 possible bytes to work with. At this I
smiled at him and told him, "But I only use FOUR!" Of course, he
didn't believe me, and/or thought I was insane. And you may certainly
be curious about that-but I actually do use all 256, I just use four
for something different than the other 252. On another note, even
assuming that SOMEHOW (assuming we're going back to the classic
counting principle) one or more files are actually increased in size
during the first layer, based on the fact that my algorithm not only
subtracts but adds as well (it just subtracts more than it adds), I can
virtually guarantee you it would make it smaller than the original size
on the second layer. I still don't see the big deal about using
secondary extensions to tell the program how many times a file has been
compressed and if it was compresstransformed first or not. Maybe it's
information, and maybe a careless person could delete that information
and then have an unusable file-but it isn't information that was in the
file in the first place. And it isn't data hiding, either. As for the
requirement of storing the file name, which would change the file
contents, that isn't necessary, either. A simple modification of the
secondary extension takes care of that problem. And all you need then
is a program that'll read it.
The fact of the matter is that I've spent a lot of time looking at the
"recursive" algorithms/programs that people have supposedly created
previously, and I've looked at a great number of compression patents,
both useful and, in the case of the aforementioned, useless. I'm
attempting to learn from the mistakes of others. Which, as one should
be able to tell by reading what I've written-what I'm proposing is not
like any recursive function anyone's proposed before. It is highly
similar to BPE, but in a 256 "base" (if you'll allow me to use that
word in that context) rather than a <128.
"The basic idea behind a substitutional compressor is to replace an
occurrence of a particular phrase or group of bytes in a piece of data
with a
reference to a previous occurrence of that phrase." "LZ78-based
schemes work by entering phrases into a *dictionary* and then,
when a repeat occurrence of that particular phrase is found, outputting
the
dictionary index instead of the phrase."
I'd refer to that as a "pointer", I'm not sure what you'd call it.
That'd be my reasoning on why typical compressed archives don't
compress-it isn't that they've removed all patterns, they've just added
pointers that the program reads a certain way and they don't know how
else to represent them (while taking up less space). So it's a
"dictionary index". Same difference, effectively.
I can't seem to find anything else within the FAQ that actually applies
in this case. I'm of the impression that one or more of you for some
reason doesn't get what I'm saying, but I've been told this is a little
complex, so I'll try again. I'm also aware, as I've stated, that
typically containing information within a filename is considered
cheating-but my argument is that I'm only containing information in
there that wasn't in the file in the first place. Assuming I'm not
allowed to put it there, I'd like to know where else I could put it.
There's certainly a possibility I could get a program to treat certain
bytes in certain positions as the markers for compressiontransformation
and compression layers, but that's a bit more complex than I care to
think about right now as I don't yet have a working prototype (Read: I
-understand- C/C++, but don't know enough to write a program of this
magnitude and/or complexity). And I've already stated I don't have a
program, either, and that I don't expect to be able to compress any
file-because of the lack of a program.
So I'm going to try this again. You take your 2^N files and you
compress all of them with a conventional (impossible) recursive
algorithm. You have one or more collisions, based on the counting
principle. This is because these algorithms reduce, ONLY. Mine both
adds and subtracts, but subtracts way more than it adds-for the moment
you're going to have to trust me on that. Or just, well, hold off your
doubting until the explanation is over. Then you can doubt as much as
you want. So since mine both adds and subtracts, every file will have
a certain byte amount and content added to it, and every file will have
a certain byte amount and content subtracted from it. Again, counting
principle does not state that I cannot compress a file to X-Y length
with my scheme to decompress it being Y, for a total archive length of
X. It may be slightly silly to do that, but that renders it
compressible, as, FACT: 2^N + 2^N-1 + @^N-2, etc, is ALWAYS GREATER
THAN 2^N.
Now we all know BPE works-what we don't ALL know is how to apply it to
a file with all possible 256 bytes; or even a file with complexity such
that it has all possible 65,536 byte pairs. I'm fairly confident that
I know how, and I'm currently in the process of learning how to
translate that into actual code. Several friends of mine have told me
I should expect to invest several months to two years into a language
before being proficient-I don't really know if it'd take that long, but
I've seen (FINALLY managed to find it) a complete, or almost complete,
list of C/C++ functions, and I admit it's no cakewalk.
I've already given you all more than enough to go on to recreate the
algorithm, if you were so inclined, even though that wasn't my intent.
We all know that, I'll insert an IF here because everyone
thinks/believes/"knows" it isn't possible, IF every file is
compressible, there's a hell of a lot of money in that. Find me
someone you can guarantee won't run off with this; if need be for legal
purposes I'll provide anyone who's interested with an NDA to sign, and
then I'll tell you how it works. But you should have the basic idea by
now, hopefully, and hopefully it can be seen that this isn't like all
the other things you've seen over the years. If I have to, I WILL read
through all nearly 30,000 comp.compression topics just to prove to you
no one's ever tried this before-and I guarantee you that will be the
case-but I'd really rather not as that could take a good deal of time.
I realize that I've repeated myself here and there, but I'm trying to
clarify because it seems like you're not hearing what I'm saying.
Maybe you are, and I'm the one who's missing something you're saying.
I don't know. But I feel that I've presented my case better than those
who have come before me, that it is different, unique even, and that an
explanational refutation has not yet been put forth.
I don't believe I'm right. I just believe in the theory. If it can be
PROVED one way or the other, and please don't refer me back to the
counting principle, as that loophole should be wide enough for this,
then I'll be happy. Hear that? Prove that I'm right, and I'll be
happy. Prove that I'm wrong, and I'll be happy. I'm just a s er of
knowledge.
I would like to thank you guys for your input, though. Even though you
believe I'm wrong and I believe I could be right...the world would
never make any progress if everyone believed the same thing. It's been
stated that compression theory, information theory, whatever you'd like
to call it, is fairly well established and that the laws it sets down
cannot be violated, and I'd really rather not go into that at the
moment, except to specify, once again, that I'm not really violating
it, it's just a loophole. Or a pigeonhole, if you will.
And I DO understand the counting principle. I understand it
completely. I understand why every idea before this one in recursive
compression failed, and I understand why this one works. As far as
SHA1, MD5, etc are concerned, filename, minus extension for (at a
minimum) Windows systems, does not contribute to actual file content.
So while it does take up space on the drive, it's not exactly a valid
point for stating that this doesn't work. I'm aware that filename
length contributes to length in the few contests that are out there on
the web-but the actual content of a file can be compressed such that it
does not gain a single bit, if some stay the same. IF it is decided
that what I'm actually doing is increasing files here and there by a
byte, two bytes, three bytes, whatever, to specify that it was
compresstransformed and/or compressed, well, if you file increases in
size by 0.1% before decreasing in size by 10% (note this is an example,
and not what I actually expect), then don't you win in the long run?
And it's still storing 2^N files in 2^N + 2^N-1 etc combinations, which
STILL doesn't violate the counting principle.
Thanks.
|
|
|
|
|