For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2006 > Re: Please, PLEASE, hold your questions/comments/elsewhat til the end. Thank you. :)









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Re: Please, PLEASE, hold your questions/comments/elsewhat til the end. Thank you. :)
Einstein

2006-04-27, 6:56 pm

Ok, here I go... the other threads are slower in means, but people here
post like I do! We write books where paragraphs could do :D

First step: Probability alteration

Via altering the probabilities in a binary string we put ourselves one
step closer to lossless repeatable binary compression

111 = 11
110 = 101
101 = 011
011 = 001
001 = 010
010 = 100
100 = 0001
000 = 0000

The only real parts critical to this step is to make sure we get a 40%
ratio of 1's and a 60% ratio of 0's. This statistically speaking
achieves this. Of course not all binary is statistically going to occur
this way (A RAR file in my experience has a LOT more 000 strings at a
increase of 110, 101, 011 strings and a reduction of 100, 010, 001
strings).

~Now there is ways to work in a command section that alters this
substitution table. 6 bits in the command section would allow a
alteration of the entire group to the most favorable of outcomes.~


Step 2: Identify the probabilities:
This is not in software code, it's just simple math. After our
conversion 0000-0000-0000-0000 has the probability of .6^16 of
occurring. 1111-1111-1111-1111 has a .4^16 chance of occurrence.
1111-0000-1111-0000 has a (.6^8)*(.4*8) chance of occuring. Ergo we can
now work out what is most likely to occur.


This next part is educational, since we must derive the most
compressible data before we can actually swap 16 bits for 16/15 bits
and place identifiers as per the real steps. Therefore this is a
~informational~ of a later step.

I made a means for a 4 bit string to equal two seperate parts. A front
end, and a back end. These ends cannot equal more than 5 bits, and most
commonly has more 0's than 1's in total. The front end is read
sequentially from left to right. The back end is applied in reverse
from right to left. Eventually they meet. The front end indicates what
possibilities can occur in the back end.

Here is the total for the 16 possibilities:


Front end
00000
00001
0001
001
01
1
0001
001
01
1
001
01
1
01
1
1

Back end
*
*
0
00
000
0000
1
10
100
1000
1
10
100
1
10
1


*NOTE a result of 00000 or 00001 on the front end indicates no back
end!

Example... we have the result 01 in the front end. Our back end
therefore has only 4 possible results: 000, 100, 10, 1. We read the
back in reverse to find it. This gives us our initial values.

Now what I did from here was to make every 16 bit outcome apply to
this. I found thier resulting size. I then applied this method to them
*again out of order, informational only*

11 = 111
10 = 110
01 = 10
00 = 0

Of course there were a LOT of strings that ended with a 1 or a 0 extra
(odd numbers). In which case for simplicities sake I assumed the worst
case of each following after: 1 = 111 (split in half for 1.5 bits) and
0 = 10 (10 split in half for 1 bit)

Of course people will not want to talk about the front end... I did the
math there to, if you want, do the math on probabilities of the front
end, and the back end both, and get a finite answer... I dont want to
make another 10 pages explaining this portion in full... the numbers
are fairly representative)

So now we have a bit length value for every 16 bit outcome. Some are
worth 10 bits therefore, and some are worth as high as 24 bits in
length. Obviously some are 'not' desirable.

We simple figure which is the smallest sized, and make it swap with the
most probable to occur. This is actually easy to do with
0000-0000-0000-0000 = 0000-0000-0000-0000. We do this for the first
32768 sequences only! (2^15). Each time one of these occurs, in a
seperate temp line (Lets call it $temp) we place a 0. Since they are
statistically more occuring, there will be a large number of 0's in
this sequence! Every time we get a result from the other 2^15 results,
we apply it in binary sequence a value equal to 15 bits in substitution
for the result, with a 1 in front. Therefore it remains 16 bits long,
but is now prefaced by a 1, and is in groups with 0's in the $temp line
identifying compressible materials.

We need to count the entire amount of those 1 results, this is a
portion of the command section. The total is made into a binary figure,
divisible by 4 bits. So ergo the result is 1024, we need to use 12
bits, and the 12 bit result that indicates 1024. We then split the
identifying bits up into 4 bit segments... so xxxx - xxxx - xxxx -xxxx
would be the view. Each time we 'continue' (IE need the remainder after
a split) we place a 0. When we no longer need to continue, a 1
indicates the end. Thus we know to an exact amount of the results that
do not compress.

So we have the command section added to, we have a $temp line, and we
have a 'to be compressed' section.

The compressed section is run through that 4 bit result... eventually
the data is completely added together, back and front ends.

Now we run the 2 bit conversion.

Statistically we have compression, regardless of 'major' fluctuations.
If there is an EXTREME fluctuation, then we just make the software
recognize it, and place in the command section the ability to reverse
all 1's and 0's, thus thwarting the fluctuation. Mind you this is on
random binary, we already ran other compression tools on this to
'reduce' the patterns to a choatic mess.


Even after the various size increase portions, we still end up with
statistical compression.

The key is the usage of the front/back system, of altering the
probabilities in the binary string, and of effective cost management in
our ability to track what is occuring.

Additionally even in our 15 bit string sequence, we can purposely favor
0 results in the more likely to occur. Other stages increase the 0's vs
1's statistically as well. The result is a cascade of 0's in all sorts
of regions per compression cycle.

The more 0's, the more compression in the results, statistically... it
becomes a ongoing chain. At worst when we hit a 'major' amount of
0's... we say have 50% are all 0's in the 3 bit stage... this means we
grow there by 133.34% in size... but then we have a lot more results
that fall into the 10 bit for 16 bit section... giving us... 0.625 bits
in return... or after multiplying them together, we get: 83.3% even if
our average is say 11.8 bits.. 98.33825% of the size remains and then
we use others and come out pretty much 'even'. Yet if we reach that
level of 0's... we can use a command section to indicate usage of a
specific group of compression results. Either the 4 bit stage, the 2
bit stage, or even the 3 bit stage. Any of these 3, or combination of
them can be applied via adding 4 bits to our command section. We can
even allow for a dictionary response or some other standard compression
scheme.


At no time do we end up with the same code as before, it is ALWAYS
changing! Therefore in effect we guarantee entropy has a hard time even
trying to apply. It's only when the file reaches a certain minimum size
that entropy could even come into play.


Without a MUCH stronger version of excel however, capable of handling
all my math at once, or a PC worth it's weight in Microsoft Stock
options, I am unable to accurately find a 'statistical' size level.

I do know this:

The larger the file, the better the results.

we swa

Willem

2006-04-27, 6:56 pm

Einstein wrote:
) Statistically we have compression, regardless of 'major' fluctuations.

How much compression ? Do you have hard figures from your math ?
Have you used the actual statistics, and not just some guesstimates ?
Here's a prediction: You can compress it to 97%.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Willem

2006-04-27, 6:56 pm

Einstein wrote:
) 95.31% is the math + command section provided even statistics with the
) current version, on the first cycle.

I'd like to see the calculations that produce this result.

When I do the math, I get 97%.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
David A. Scott

2006-04-27, 9:55 pm

"Einstein" <michaelhh@gmail.com> wrote in
news:1146151409.670683.41580@i39g2000cwa.googlegroups.com:

>
> At no time do we end up with the same code as before, it is ALWAYS
> changing! Therefore in effect we guarantee entropy has a hard time even
> trying to apply. It's only when the file reaches a certain minimum size
> that entropy could even come into play.
>
>


Please think for one second and then give an honest amswer.

Suppose you pick some length at which you think all files could be
compressed say 10 bits thats 2**10 strings you could map one to the
empty string then 2 could map to the 2 bit strings and 4 could map to
the 3 bits strings in fact you could have all but one string mapped to
samller values. Lnow lets look at what happens if you add the ability
to compress 11 bit strings. all the string from 0 to 9 bits have been used
to compress most of the 10 bit srings. That means to attempt to compress
the 11 bit sting you could map one less than half of then to the 10 bit
strings the rest would have to stay as 11 bit strings. Try to guess
what happens when you go to 12 and 13 bit input strings its not a
pretty picture. Unless you pretend the only useful strings have exactly
the same number of ones ans zeroes. That will compress.


Wait I see it now 10 bits to to small then change 10 to the n
that you belive exists and 11 to the n+1 I hope you see the trend,


David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

sillybanter@gmail.com

2006-04-28, 7:55 am

Einstein <michaelhh@gmail.com> wrote:

> Ok, here I go... the other threads are slower in means, but people here
> post like I do! We write books where paragraphs could do :D


Wow -- you write books where nothing would do... How many hours of
your life have you wasted on this when almost everyone here could have
told you in the first 30 seconds that what you were doing won't work?
Your time would have been more productively spent lying outside in the
grass and enjoying the fresh air...

So back to what you've posted. Your "step 1" was clearly stated but
utterly useless. All you've done is change the statistics -- big
deal. You could have started with the following statement: "I can
compress any binary string with 0/1 probabilities of 0.4/0.6 so that
on average I get a result smaller than 97% of the input." That's what
you have after the first step, after all, so let's just *start* there.
And, incidentally, I picked 97% because Shannon's basic theorem says
that this is impossible.

Incidentally, changing statistics is easy. Give me *any* 0/1
probability split and I'll give you an encoding that achieves
approximately those probabilities. This is simple and provably
doesn't do anything useful for compression.

So now you go through some horribly complex coding process and think
you can beat the 97% figure. Here's the wonderful thing about
Shannon's theorem: It doesn't matter for beans how complex that
coding stage is. You could have a completely complicated process that
requires 1000 pages to describe, with carefully crafted lookup tables
and clever coding tricks, and it just wouldn't matter. The *only*
thing you gain from the complexity is a higher chance of getting
yourself and making a mistake -- which is exactly what you've
done.

There are two possibilities for your steps 2-whatever:

1) You've got a valid coding scheme but have messed up the
math/analysis so that it really doesn't give any compression.

2) You've produced an invalid code (a code that isn't uniquely
decipherable, for instance).

I'll leave it for you (or others who have more time than I do) to
discover which is the case, but I'll give you a simple exercise that
you can do that will demonstrate your error: Take the smallest input
size where you think your technique will work -- if that's 8 bits you
can do this by hand, but for 16 bits you'll have to write a program.
Now do your coding on every possible input string of that length so
you basically have a lookup table. From what you claim, it sounds
like you think you can make a lookup table where if the input is 16
bits then the outputs will be (on average?) 15 bits or smaller. Now
from that output it will be easy to determine which of the two
possibilities above apply by testing with the Kraft inequality: if
your resulting code violates the Kraft inequality, then it's an
invalid code (case #2); if it satisfies the Kraft inequality, then
it could possibly be a valid encoding, but when you weight the strings
the probabilities of them occuring you'll get something larger than
97% of the input size -- guaranteed (this means you're in case #1).

So have fun. But don't forget to go outside and enjoy the fresh air
every now and then...

--

Steve Stringer
sillybanter@gmail.com


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com