Home > Archive > Compression > April 2006 > question about entropy and compression method
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
question about entropy and compression method
|
|
|
| hi !
let's imagine I want to compress a text file made of bits
25% of the bits are 1, and 75% are 0.
anybody knows how to compute the entropy of this file and to use an
arithmetic compressor.
but what is the new entropy of this file if I also have a condition : there
can't be more than six consecutive "1" ?
how can I compress this kind of file ?
thank you for any method or idea
cp
| |
| Willem 2006-03-28, 6:55 pm |
| cp wrote:
) let's imagine I want to compress a text file made of bits
A _text_ file ?
) 25% of the bits are 1, and 75% are 0.
)
) anybody knows how to compute the entropy of this file and to use an
) arithmetic compressor.
)
) but what is the new entropy of this file if I also have a condition : there
) can't be more than six consecutive "1" ?
)
) how can I compress this kind of file ?
Use a static hardcoded order-7 model.
In this specific case, you only need two probability sets.
Do you have some more info on what you're trying to achieve ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
|
| Hello !
Thank you for having answered.
The file I have to compress looks like this (extract) :
.... 0000000010000010001101000000000000101000
1010111010001000010001000001010010001100
0000101001000001100000100000001000000010
10001000010010000100100...
There are many sequences of "1" , much less of "11" , much less of "111"
and so on...
And there are not sequences of more than 6 consecutive 1 (eg : "111111" )
I can read the whole file and know exactly the probability of each sequence.
I can also create a file of bytes.
> Use a static hardcoded order-7 model.
could you explain me in a simple way how to do that ?
if you prefer, I can reduce the problem to only three possible sequences : 1
, 11 and 111. (in order to understand your explanations, then I will extend
your solution to six sequences)
I hope I am clear...
Can you give me an example about how to do that ?
> In this specific case, you only need two probability sets.
How do I create them and how do I use them ?
I never did that before...
Thank you
cp
| |
| Willem 2006-03-28, 6:55 pm |
| cp wrote:
) Hello !
)
) Thank you for having answered.
)
) The file I have to compress looks like this (extract) :
)
) ... 0000000010000010001101000000000000101000
1010111010001000010001000001010010001100
0000101001000001100000100000001000000010
10001000010010000100100...
)
) There are many sequences of "1" , much less of "11" , much less of "111"
) and so on...
)
) And there are not sequences of more than 6 consecutive 1 (eg : "111111" )
) I can read the whole file and know exactly the probability of each sequence.
) I can also create a file of bytes.
Do I understand correctly that you have one single file, and you have to
write a compressor for that ? Does the size of the compressor count ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
|
| hello !
> Do I understand correctly that you have one single file, and you have to
> write a compressor for that ?
of course not :))))
I wouldn't ask for a method if I had just one file to compress
I saw I wrote "The file" , but this was just a mistake from me, sorry...
I have many files and I need a method , this was just an example , but this
example is ok for all of the files !
There are no sequences of more than six consecutives "1" in any of these
files, and there are much more sequences "1" than "11" and so on...
> Does the size of the compressor count ?
no, of course, but I need a method, some explanations about what you told
if you have the time, can you explain me how do you the static order-7 (or
as an easier example the order-4 ) model ?
And how do you use the two probability sets ?
thank you in advance
cp
| |
| Willem 2006-03-28, 6:55 pm |
| cp wrote:
) I have many files and I need a method , this was just an example , but this
) example is ok for all of the files !
)
) There are no sequences of more than six consecutives "1" in any of these
) files, and there are much more sequences "1" than "11" and so on...
Maybe you should generate statistics on how many times certain length
sequences occur. Check if these statistics match the expected statistics
for independent data. An easy way to do is is to calculate the entropy
as if a sequence of ones is a single symbol and compare that with the
entropy you got for the binary sequence.
By the way, how do you know that a sequence of 7 "1"s cannot occur ?
) no, of course, but I need a method, some explanations about what you told
)
) if you have the time, can you explain me how do you the static order-7 (or
) as an easier example the order-4 ) model ?
Well, an order-7 model is a model that 'looks back' 7 symbols and
determines the statistics from that. A static model means that it is
precalculated, and hardcoded means it is programmed into the compressor.
See below.
) And how do you use the two probability sets ?
Well, when you haven't seen 6 "1"s, the probabilities are 75/25,
but when you have, the probabilities are 100/0.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| cr88192 2006-03-28, 9:55 pm |
|
"cp" <cp@cp.fr> wrote in message
news:44298b7d$0$29212$8fcfb975@news.wanadoo.fr...
> hi !
>
> let's imagine I want to compress a text file made of bits
>
> 25% of the bits are 1, and 75% are 0.
>
> anybody knows how to compute the entropy of this file and to use an
> arithmetic compressor.
>
maybe not needed.
> but what is the new entropy of this file if I also have a condition :
> there can't be more than six consecutive "1" ?
>
> how can I compress this kind of file ?
>
> thank you for any method or idea
>
one could, instead, grab groups of a certain number of bits, and use some
fixed coding.
00 -> 0
0 -> 10
1 -> 11
000 -> 0
00 -> 10
0 -> 110
1 -> 111
or (less agressive):
000 -> 00
00 -> 01
0 -> 10
1 -> 11
so, fixed cases:
0=75%
1=25%
00=56.25
01=18.75
10=18.75
11=6.25
000=42.1875
001=14.0625
010=14.0625
011=4.6875
100=14.0625
101=4.6875
110=4.6875
111=1.5625
so, following this pattern (building a huffman tree):
\ (110 111) 6.25
\ (011 101) 9.375
\ ((110 111) (011 101)) 15.625
(001 010) 28.125
(100 ((110 111) (011 101))) 29.6875
skipping a few steps here:
(000 ((001 010) (100 ((110 111) (011 101))))) 100
thus (as one possible mapping):
000 -> 0
001 -> 100
010 -> 101
100 -> 110
110 -> 11100
111 -> 11101
011 -> 11110
101 -> 11111
minor rearange to have lexiographic order:
000 -> 0
001 -> 100
010 -> 101
100 -> 110
011 -> 11100
101 -> 11101
110 -> 11110
111 -> 11111
concievably optimal for 3 bits at a time (4 bits may be better, but I don't
feel like building the table...).
other alternatives:
grab whole bytes at a time and use huffman;
implement an order-0 bitwise arithmetic coder.
ushort min=0x0000, max=0xFFFF;
void enc_bit(int b)
{
int r;
r=max-min;
r=min+(((uint)r)*3/4); //hard coded probability
if(b)min=r+1; else max=r;
while(!((min^max)&0xFF00))
{
out_byte(min>>8);
min<<=8;
max=(max<<8)|0xFF;
}
}
can't say it will work right though, and the decoder isn't defined here...
> cp
>
| |
| David A. Scott 2006-03-29, 7:55 am |
| "cp" <cp@cp.fr> wrote in news:4429bb2c$0$20154$8fcfb975@news.wanadoo.fr:
> hello !
>
>
> of course not :))))
>
> I wouldn't ask for a method if I had just one file to compress
>
> I saw I wrote "The file" , but this was just a mistake from me,
> sorry...
>
> I have many files and I need a method , this was just an example , but
> this example is ok for all of the files !
>
> There are no sequences of more than six consecutives "1" in any of
> these files, and there are much more sequences "1" than "11" and so
> on...
>
It would not be hard to write such a compressor. But how do you know
there will never be more than 6 ones. Do you have all the files or
can more appear. Also what about the series of zeroes how many zeros
in a row, Do they follow simalare statistics like the ones?
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
|
| hi !
thank you for your examples, I am going to implement them and to see the
result
I will keep you informed
bye
cp
| |
|
| hi !
> It would not be hard to write such a compressor. But how do you know
> there will never be more than 6 ones.
it is guaranteed by the provider, that's all what I can tell you
> Do you have all the files or can more appear.
more will appear
> Also what about the series of zeroes how many zeros in a row, Do they
> follow similar statistics like the ones?
one can have a maximum sequence of 20 zeroes (however the biggest I "saw"
until now was 14)
could you give me a good explanation about how to write such a compressor ?
I know I am going to try the solution suggested by cr88192 and to compare it
to the well knowns compressors (zip, bwt, arithmetic coders...) , in order
to see what I will keep.
thank you
cp
| |
| David A. Scott 2006-03-29, 9:55 pm |
| "cp" <cp@cp.fr> wrote in news:442af71b$0$21285$8fcfb975@news.wanadoo.fr:
> hi !
>
>
> it is guaranteed by the provider, that's all what I can tell you
>
>
> more will appear
>
>
> one can have a maximum sequence of 20 zeroes (however the biggest I
> "saw" until now was 14)
>
> could you give me a good explanation about how to write such a
> compressor ?
>
> I know I am going to try the solution suggested by cr88192 and to
> compare it to the well knowns compressors (zip, bwt, arithmetic
> coders...) , in order to see what I will keep.
>
> thank you
>
> cp
>
>
>
It would not be that hard to write. It would be nice if one knew more
info the more one knows the better one could write a specific compressor.
Like can the file start with either a one or zero. Does the file end
with a particular pattern and so on.
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
| cr88192 2006-03-30, 3:55 am |
|
"cp" <cp@cp.fr> wrote in message
news:442aeee5$0$21273$8fcfb975@news.wanadoo.fr...
> hi !
>
> thank you for your examples, I am going to implement them and to see the
> result
>
> I will keep you informed
>
ok.
(me, meanwhile, still beating some against jpeg, imo one of the more
questionably designed formats I have seen recently...).
almost temped to give up though, not much need, and a few issues I still
haven't resolved:
every so often, a few of the huffman decoded values comes out wrong, fouling
up nearby blocks and generating a general screwup in the DC values for the
rest of the image, along with more screwups (since most of the rest of the
image seems to decode sanely, I suspect it does not involve reading more or
less bits or otherwise fouling up bitstream alignment, but may cause those
blocks to not end correctly, and will mess things up for the next few
blocks).
thought maybe it might help if I did the code assignment in a more
deflate-like manner, but this didn't work (it results in an image filled
with garbage). since most of it comes out ok, I suspect the huffman tables
are ok, but maybe there is a problem somehow in decoding the blocks
themselves?...
or such...
> bye
>
> cp
>
>
>
| |
| Willem 2006-03-30, 7:55 am |
| cp wrote:
) hi !
)
)> It would not be hard to write such a compressor. But how do you know
)> there will never be more than 6 ones.
)
) it is guaranteed by the provider, that's all what I can tell you
Okay.
Have you looked at the distribution of different-length sequences of ones ?
Is it such that when you've seen, say, three ones, that the fourth still
has a 25% chance of being a one ?
If so, here's a very simple solution:
Before encoding, replace all occurrences of "1111110" by "111111", and
after decoding, replace all occurrences of "111111" by "1111110".
)> Also what about the series of zeroes how many zeros in a row, Do they
)> follow similar statistics like the ones?
)
) one can have a maximum sequence of 20 zeroes (however the biggest I "saw"
) until now was 14)
This is also guaranteed by the provider ?
See above, but instead of six ones, read twenty zeroes.
Alternatively, if the distribution is not independent, here's another idea:
First, divide your stream into chunks, where each one is a run of ones
followed by a run of zeroes. There are 6 possible runs of 1, and 20
possible runs of 0, so you can easily put that into one byte. Then,
calculate the statistics over all the files, hardcode that into the
compressor, and use it to encode each stream of bytes.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| Willem 2006-03-30, 7:55 am |
| Willem wrote:
) Alternatively, if the distribution is not independent, here's another idea:
) First, divide your stream into chunks, where each one is a run of ones
) followed by a run of zeroes. There are 6 possible runs of 1, and 20
) possible runs of 0, so you can easily put that into one byte. Then,
) calculate the statistics over all the files, hardcode that into the
) compressor, and use it to encode each stream of bytes.
Alternatively, you can have your encoder/decoder keep track of how many
ones or zeroes it has seen in the last run, and calculate statistics on
that. Here's a small bit of (pseudo)code to show what I mean:
for (i = 0; i < length; i++) {
total[run]++;
if (bits[i] == 1) {
ones[run]++;
if (run < 0)
run = 0;
run++;
} else { /* bits[i] == 0 */
if (run > 0)
run = 0;
run--;
}
}
for (run = MINRUN; run < MAXRUN; run++) {
printf("Probability of a '1' after %d '%d's: %f\n",
abs(run), (run < 0)?0:1, ones[run] / total[run]);
}
If all the probabilities are roughly 0.25, but after 6 '1's or
after 20 '0's, it is 0.00, then the distribution is independent.
If not, you can encode the list of probabilities over all files
in your compressor and use them as your arithcode-probabilities.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| David A. Scott 2006-03-30, 7:55 am |
| Willem <willem@stack.nl> wrote in
news:slrne2nf2e.dgr.willem@turtle.stack.nl:
>
> If so, here's a very simple solution:
> Before encoding, replace all occurrences of "1111110" by "111111", and
> after decoding, replace all occurrences of "111111" by "1111110".
>
>
I like this idea but to flesh it out a bit more so as to be fully
bijective take any occurrence of 6 ones followed by future zeroes
and ones drop one of the zeroes and do same for 20 zeroes followed
by future ones and zeroes drop one of the ones.
The above modification handles the rare endings of 111111 versuses
1111110 since if you remove the trailing zero then nothing would
distinguish it from the case where the thing flat ended
with 6 ones. Or another way to look at do as you stated but don't
mess with the last substring at end of the main string.
At this point from what little info "cp" give it could be a
"purely random iid" string of 1's and 0's where its roughly
25% ones and 75% zeroes. You could then use something like one
of my adaptive bijective compressors except fix it at the ratio
he gave then each remaining one would use 2 bits and each
reamaining zero would use roughly .321928 bits.
This of course is based on what little info "cp" gave however
its not hard to connect roughtly 20 two state cells together and
model each length seperately or thousand of two state cells assuming
they are not independent. But the bigger the model the more the
cost of adaption it would be nice to use the correct model with as
little adaption as possible, But to do that you need all the info
about the strings to be compressed.
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
|
| Hello !
Thank you to you three for your answers.
I have here an extract from a file (I added the carriage return, but I don't
know how this will be displayed when you read this post),
so you can see better what I have to compress. Can one of you give me the
possible compression ratio for this kind of file ?
I am going to test your suggestions
Thank you and have a nice day (or night)
cp
....
0101000001100000000000000001001100100000
0011001001010000001000000011001000100000
0000000100010000000010010001001000000110
00000101
0100000001100000010100100010000110000100
0000000000000101100100001010110100000000
1000010110000011000000000010000001000000
01000001
0001100010111010000000000000100000100000
0000101101001000011010000100010001000100
1011000001000000001000000010100001000010
10000000
0000001001001100010000000000100000001100
0100000000010011100100000000000110000000
1100100001110110001010000100100000000000
10001110
0000010000000011100010010001000011100110
0100101001000000100001100000010000000000
1000000000100100100000100100010100000010
10110000
0000010100011000000100000000000011001001
0000110000000001000010100100000000100000
0100100001000110011000000000101101000000
00001110
0010001000000010010000001001110000000001
0010000000010001000110000000110010100010
0001010001000010000100000000000110000001
01010000
0000100010110101000110000100011000010000
0001000000010000100100000100000000000100
0010010001110001000010000011000100001010
10000000
0100000111000100010101000000100000000100
1000010010100000000000011001000000000000
0010011110000100010000001010010001000001
10000100
1100001000001111100010000000110000001001
0000000010010100010001000100001000001001
0110111010000000000001001010010000001011
00101000
0001110000000000000000010100000100001000
0001011110010000000101001000000010000000
0011010001000111001000010000001010000100
00001000
0000010000000000100001000011100101000001
0100100000010100000011000010000000000101
0100001001011000000000111010001000000000
00110100
1001101000001000000000100000100000010010
0000011000000100000000100001000111000100
1000000000001010000100000110100010000000
10010101
0011000001000000000110001000100100000011
0000000000110110100000100100100000101011
1000100011000010000000000000000000010110
00010000
1000000001001100000010000010000000000000
0010100011010100001000101000010000000100
0101011000100000001010001000001000100000
01100001
0001000000100000010010011000100000000000
0100000000010010000000001100101011000010
1000000000100000000000010110101001000000
10000000
0000110001011101111000000000000000001000
1010000101000000010000100001001001010010
0000001010101000001010000011001000000000
00000000
1001000000000101010000100000101000101000
1010000000000000100010000000000000011001
0000011100010001000000111101000000001010
00110110
0100000000000010000010100100000010000100
1010000000001000000011000001000100000010
0010000010000010101100011010000000101100
01001010
0010000000000000110000110000000010011000
0000000100000000001110000000000010000100
0000010010001000100111110000001000000010
00000100
0000010001000001000100001011000000100010
0100100010010001000000000001000000000010
1010000101000000110100100001100100101000
00001000
1100000000101000010001000011000010000001
0000000100111001000000001100000000000000
0010100001000000000000000000010000111011
00000010
0000110010100100100100100001011001000000
0010000001000000010001010100000000100001
0000000010101010100000111000000001000000
10010100
0001010010000100000000010010100000000100
0011010001100010000000000110000100000000
1000100000000000110001000111000001010111
00000100
0010100001000100000000000010100001011100
1010010000100000000001000001010000000000
0010000100001001111100000000110000100000
00000010
0010010100000000011010001001010000001000
0001010001000100001010000000000100001100
0111000000010010000000000011000000000011
00010001
0001011000000000011000010100000000000010
0101000100001100000000100000000010111010
0000010010100001000010101001000000011110
01010001
0000001000011000001111000100000000100001
0000100000001000001001000000000010000010
0001000010011010000000110000001000110001
00000101
0101010100000100000010110000001000000000
0100001000000010001001010000000000110001
0000011000000010000000011010101000000010
00110001
0000000000000000000100100110000000100000
0010000110001010000000000010001000001110
1101000000000000011001000100000000111001
00100100
0010000000000100010001000100110100000000
0000000000011001000010001100100000101000
0011001100100000000101100010100100011100
00000000
0000000000101010000100100000011010010110
0000000100000000000000001000000001000001
0111100100100001110000001100010010111011
00000011
0100000000000100000000000101000010000000
0000000011000010100100000001000010000000
0001100000001110000000010011000001001101
00000000
0100001000000010001101000000000011001010
0101000000010000001010000001010011001100
1010010100100000000000100000100000010010
00100000
0010000000000101001000100000000010001100
0010010000001000000001100001101110000101
0101000000001000000000100000000000000100
01010000
0100011000000001010000000010010001000010
1000001010100000000001011100100100000101
0011001000001000001010000000000010010010
01110100
0000000000000100010100100000011011101010
0110000000000010000010000001000000000010
0110010010000000000010000001110000000000
10100100
1100001000100000001000001100101000110000
0000100110001010000001000010000100101100
0100000010000010000001100111011000000000
10010000
0000101000100100001000000010110001100000
0000000000000001001100110010010010000000
0100010010000010000010000000011000101100
00000001
0001001000001001010001101000000000100010
0101100000000100001000011001100000000100
0000000000100000010101101000110000000100
00000100
1000001000100000110100000100000011000010
1100000100001000000000000001001010001000
0000010000010001100000001011000100101010
01001000
0000101011111000000000100000010000000000
0010001100100000000001001000000100000011
0001000101001001000001000000000011000010
01010000
1110001000000000010101010000000010000000
1000001100111000010000100000000110000010
0000000001000010000101000000000010110100
10001000
0110010001110111000111000000010000000000
0000000011000000010011001001010100000000
1000010101000001010100001010000001000000
10000001
0011001000000000001001100011000001000100
0000100000100000001100000101100001001100
0010100001100001100010000010001010001000
00100000
0001011000001001000001001000001000000000
1000010010111000001100010000001000001000
1000110001000000100010000100100000010000
00010100
0000010100000100111000001010000000000010
1100000011010010000000010100000010110001
0000000001000010110001000100100000000110
00001000
0010110100000011000010101000100000000000
0000001010010000000110001000001100010100
1001001100000100010000000000111001000000
11000001
0000000100100000000110010011010111000000
0000000001000110010001000000010010100010
0001001010000000000001110010001000100000
00000100
0111001000000000101001010000000000100000
0010100110100010000000110000000100100101
0000100000010001010001000101000001000000
10110100
0000000000010000000110100000010100110001
0000010100001010010000000010010000000000
1110001011000101000010000000100000001100
00000001
0000000011000000101010101001110001011100
0000100100000110000001000000000001100001
1000000000110110001000001000000011000010
00101000
0000001000100000101000101010000000010000
1010010011000110010000010000000000000000
0001100110000001010001100000110010000110
00000000
0001000010010011000010000100010000000001
1010100100100000000010100000000010000010
0110010111100000010000010000000100000010
00010000
0011000001001001100000000011110011000000
0011000010001100000010010001000000001010
0000110000100001000000000101001000100101
00100000
0100010010000000010000100000011010000101
0100000010001000000000100101000010001000
0010010000100000100001010000000101010010
00000010
0000010100000010000011011001001000100000
0001111000000100000000000100000000001000
0001000000010100000001100000100100001111
11011001
0000000000000000000100101000000000000001
0100000000000000010101010000010010000011
0100100000001000000000010010011010110000
01000001
1100001000010000101000100000000101000000
0100010100000100000010000011010000000011
1001000101100100100000000000001011010001
00000000
0110000010010000011010000000010010000100
0100000110010001100001100000000110011000
0100000000001000100000010101000000000000
00001000
1001001101000000100010011000000010000001
1100000010010000110101000001000101101000
0010000010000100000000000000110010000010
00000011
0010000000000000100111010001000000000100
0000100011010011110100000001010010000000
1100000000000000001000000000100100011010
00010000
0000000110001010000100000101000001000101
0001010010001011100000001000000000000000
0000111001000000111001010010000010000001
10011010
0001010000111000010000100001100000001011
0010000000000000011100000000000011000100
1101001100000000000101000100000000000010
00101010
1000000001010101100000000000110000011110
0001100100001000001000111000000000011000
0000010000101000000011000001000001101001
00100001
0000000000100001010000010110100101001000
0000100000000000010010000000110000000000
1000010010001001000001010000010000010010
00000000
0001001010001100001001111010100000010001
1000100000001001010000000001000010110001
0010100010100000010000000000000000110000
00000000
0000111010000110101010000000000000011110
0000010010000000100001100010000000101000
0010000000100110001110100000000001110000
00010001
1000000001011100110000010000010000100000
0000100010000001110000100010000001000101
1010000100011010100000000010000010110000
00000001
0000001000000000000011111010001000010000
0000001100101000000110100000100000110100
0000011101001000010010000100011000001000
00000000
0100001000000010001000100010100100011100
1100100010001000000000000001010000010101
0100010101010001000000000001000000001000
00011000
0101000011000000110100000000010000111011
1000000000000000110000000000101100000010
0010010000001100010000100000100000011110
10000000
0000101000001101000010100010010000000000
0100100000010001000101010000010011110000
0000100010100000100100011101001000000000
00001100
1001000000000000001000000101000000100110
0001010001010100000000001000100000000000
1000001010000110000000000000010000100000
10010001
....
| |
|
| hello !
> (me, meanwhile, still beating some against jpeg, imo one of the more
> questionably designed formats I have seen recently...).
did I wrong understand or do you try to compress jpeg files ?
if you try to compress by a brute-force algorithm, they look like random
files,
so I think it will be difficult ... however, you are right to try, one can
always
discover some interesting things...
I had read a company discovered an algorithm which compressed jpg files
up to 30% of their original size, so, good luck , if this is your purpose !
> every so often, a few of the huffman decoded values comes out wrong,
this should not happen, so something is wrong... did you post your code
here ? a specialist of the jpg could probably help you...
have a nice day
cp
| |
| Willem 2006-03-30, 9:55 pm |
| cp wrote:
) Hello !
)
) Thank you to you three for your answers.
)
) I have here an extract from a file (I added the carriage return, but I don't
) know how this will be displayed when you read this post),
) so you can see better what I have to compress. Can one of you give me the
) possible compression ratio for this kind of file ?
)
) I am going to test your suggestions
I ran the following program on your file snippet, and it looks like
the distribution is independent. But it was a small sample, so the
data isn't very accurate.
Maybe you could run it on the entire set and see what it produces ?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
int i, c;
int buff[500];
int *totals = buff + 100;
int *ones = buff + 300;
int run = 0;
int minrun = 0, maxrun = 0;
FILE *f;
memset(buff, 0, sizeof(buff));
for (i = 1; i < argc; i++) {
f = fopen(argv[i], "r");
if (!f) continue;
while ((c = getc(f)) > 0) {
if (c == '1') {
totals[run]++;
ones[run]++;
if (run < 0)
run = 0;
run++;
if (run > maxrun)
maxrun = run;
} else if (c == '0') {
totals[run]++;
if (run > 0)
run = 0;
run--;
if (run < minrun)
minrun = run;
}
}
fclose(f);
}
for (run = minrun; run <= maxrun; run++) {
if (totals[run] > 0) {
printf("After a run of %d '%d's, probability is %f (%d/%d)\n",
abs(run), (run < 0)?0:1,
(float)ones[run]/(float)totals[run],
ones[run], totals[run]);
}
}
return 0;
}
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| cr88192 2006-03-30, 9:55 pm |
|
"cp" <cp@cp.fr> wrote in message
news:442c3bd1$0$6668$8fcfb975@news.wanadoo.fr...
> hello !
>
>
> did I wrong understand or do you try to compress jpeg files ?
>
no, trying to decode into a usable image (internally, my programs tend to
use flat RGBA for everything). main project right now is a 3d engine.
also, it may be useful to understand the format, as it has some interesting
aspects:
now that the markers are understood, it is an interesting idea, albeit I am
unsure of it's in-general usefulness (typical static file formats). could
make sense for annotating streams, or resuming in the middle somewhere, ...
> if you try to compress by a brute-force algorithm, they look like random
> files,
> so I think it will be difficult ... however, you are right to try, one
> can always
> discover some interesting things...
>
> I had read a company discovered an algorithm which compressed jpg files
> up to 30% of their original size, so, good luck , if this is your purpose
> !
>
not really.
more I want to use jpegs again, because jpegs are small vs pngs, but don't
want to regain a libjpeg dependency (would rather write my own code, but,
hassles, T.81 is an annoying spec). the jpeg of an image is 50kB, the png is
700kB, the tga is 2.5MB.
don't need to encode jpegs though, useful enough for read-only uses.
png is needed (and originally replaced jpegs) because of the presence of
alpha channels. not all images need alhpa channels, and some are large
(could get a lot more benefiet from using jpeg).
then again, other aspects of the engine could use a lot more work, eg:
getting the newer script vm working. newer language is largely java-like,
but has some features, eg, variant types, lambdas, that are imo needed, and
it is generally based more around "how I do things", rather than enforcing a
strict oo mindset.
or I could go work on the physics, a lot of things here are standing out
that seem to need fixing (improving contact behavior, making "conveyor
forces" only work on the ground, fixing my friction model, ...).
and so on...
>
>
> this should not happen, so something is wrong... did you post your code
> here ? a specialist of the jpg could probably help you...
>
maybe, need to transfer it over from my laptop, maybe work on it more. can
post eventually if I can't figure it out...
I suspect it is minor, given that things largely decode ok (up until a
certain point).
> have a nice day
>
> cp
>
>
| |
|
| you could use seven ones as a special code to precode an extra symbol
in the symbol space which coded for duplicate patterns of eight bits
so 10001001 10001001
to 1111111 10001001
a grand saving of one bit!
he he
| |
| cr88192 2006-03-31, 3:55 am |
|
"jacko" <jackokring@gmail.com> wrote in message
news:1143770727.137482.159630@j33g2000cwa.googlegroups.com...
> you could use seven ones as a special code to precode an extra symbol
> in the symbol space which coded for duplicate patterns of eight bits
>
> so 10001001 10001001
> to 1111111 10001001
>
> a grand saving of one bit!
> he he
>
as for the original topic, if there is a strong probability distribution
twards 0, (the original figure of 75%,0 25%,1) then either huffman or
arithmetic coding is likely to get the best results. any gains from longer
patterns are likely to be insignificant.
| |
|
| Using arithmetic coding with 6 bit symbols (all 0-63) should yield max
compression since we have a condition that symbol cant be 111111. hence
the max probability is for 000000 = (3/4)^6 and so on.
<snip>
>
> did I wrong understand or do you try to compress jpeg files ?
>
> if you try to compress by a brute-force algorithm, they look like random
> files,
> so I think it will be difficult ... however, you are right to try, one can
> always
> discover some interesting things...
>
> I had read a company discovered an algorithm which compressed jpg files
> up to 30% of their original size, so, good luck , if this is your purpose !
>
This link may help:-
http://www.hindu.com/seta/2005/06/0...60900151500.htm
Regds,
Nishu
| |
|
|
> Maybe you could run it on the entire set and see what it produces ?
on the entire set : no !
too many files...
but I can test it on some files which have around 2 millions bits (the size
should be enough big to give us an enough good information) : I'll tell you
tomorrow
| |
|
| Hello !
I took a file from those I have, and I can give you the following
informations :
size : 2,001,803
number of 1 : 475,367 (percent : 23.747 % )
number of 0 : 1,526,437 (percent : 76.253 %)
I executed the program of Willem, it gives that :
After a run of 20 '0's, probability is 1.000000 (3/3)
After a run of 19 '0's, probability is 0.998089 (1567/1570)
After a run of 18 '0's, probability is 0.276831 (601/2171)
After a run of 17 '0's, probability is 0.287262 (875/3046)
After a run of 16 '0's, probability is 0.303453 (1327/4373)
After a run of 15 '0's, probability is 0.282880 (1725/6098)
After a run of 14 '0's, probability is 0.291754 (2512/8610)
After a run of 13 '0's, probability is 0.268417 (3159/11769)
After a run of 12 '0's, probability is 0.276066 (4488/16257)
After a run of 11 '0's, probability is 0.263790 (5825/22082)
After a run of 10 '0's, probability is 0.269268 (8137/30219)
After a run of 9 '0's, probability is 0.248825 (10010/40229)
After a run of 8 '0's, probability is 0.259843 (14123/54352)
After a run of 7 '0's, probability is 0.245310 (17667/72019)
After a run of 6 '0's, probability is 0.250411 (24059/96078)
After a run of 5 '0's, probability is 0.237555 (29935/126013)
After a run of 4 '0's, probability is 0.243615 (40586/166599)
After a run of 3 '0's, probability is 0.229340 (49578/216177)
After a run of 2 '0's, probability is 0.237334 (67272/283449)
After a run of 1 '0's, probability is 0.224110 (81872/365321)
After a run of 0 '1's, probability is 0.000000 (0/1)
After a run of 1 '1's, probability is 0.232133 (84803/365321)
After a run of 2 '1's, probability is 0.231596 (19640/84803)
After a run of 3 '1's, probability is 0.224898 (4417/19640)
After a run of 4 '1's, probability is 0.223002 (985/4417)
After a run of 5 '1's, probability is 0.203046 (200/985)
After a run of 6 '1's, probability is 0.005000 (1/200)
After a run of 7 '1's, probability is 0.000000 (0/1)
> At this point from what little info "cp" give it could be a
> "purely random iid" string of 1's and 0's where its roughly
> 25% ones and 75% zeroes. You could then use something like one
> of my adaptive bijective compressors except fix it at the ratio
> he gave then each remaining one would use 2 bits and each
> remaining zero would use roughly .321928 bits.
I would like to know if you didn't do a mistake here !
if I do what you say, I have : 475,367 x 2 = 950,734 bits
1,526,437 x 0.321928 = 491,403 bits
total : 1,442,137 bits
if the file was random, the entropy would be :
p0*log2(1/p0) + p1*log2(1/p1) = 0.79 bits per char =
0.79 x 2,001,803 = 1,583,041 bits
are you sure about what you said ?
where can I donwload your adaptive bijective compressor which gives such a
good ratio ?
have a nice day
cp
| |
|
| Hi !
As I told you yesterday, I took a file from those I have, and I can give you
the following informations :
size : 2,001,803 bits
number of 1 : 475,367 (percent : 23.747 % )
number of 0 : 1,526,437 (percent : 76.253 %)
I executed your program , it gives that :
After a run of 20 '0's, probability is 1.000000 (3/3)
After a run of 19 '0's, probability is 0.998089 (1567/1570)
After a run of 18 '0's, probability is 0.276831 (601/2171)
After a run of 17 '0's, probability is 0.287262 (875/3046)
After a run of 16 '0's, probability is 0.303453 (1327/4373)
After a run of 15 '0's, probability is 0.282880 (1725/6098)
After a run of 14 '0's, probability is 0.291754 (2512/8610)
After a run of 13 '0's, probability is 0.268417 (3159/11769)
After a run of 12 '0's, probability is 0.276066 (4488/16257)
After a run of 11 '0's, probability is 0.263790 (5825/22082)
After a run of 10 '0's, probability is 0.269268 (8137/30219)
After a run of 9 '0's, probability is 0.248825 (10010/40229)
After a run of 8 '0's, probability is 0.259843 (14123/54352)
After a run of 7 '0's, probability is 0.245310 (17667/72019)
After a run of 6 '0's, probability is 0.250411 (24059/96078)
After a run of 5 '0's, probability is 0.237555 (29935/126013)
After a run of 4 '0's, probability is 0.243615 (40586/166599)
After a run of 3 '0's, probability is 0.229340 (49578/216177)
After a run of 2 '0's, probability is 0.237334 (67272/283449)
After a run of 1 '0's, probability is 0.224110 (81872/365321)
After a run of 0 '1's, probability is 0.000000 (0/1)
After a run of 1 '1's, probability is 0.232133 (84803/365321)
After a run of 2 '1's, probability is 0.231596 (19640/84803)
After a run of 3 '1's, probability is 0.224898 (4417/19640)
After a run of 4 '1's, probability is 0.223002 (985/4417)
After a run of 5 '1's, probability is 0.203046 (200/985)
After a run of 6 '1's, probability is 0.005000 (1/200)
After a run of 7 '1's, probability is 0.000000 (0/1)
have a nice day
cp
| |
|
| > where can I donwload your adaptive bijective compressor which gives such a
> good ratio ?
I just want to check what you said, because I don't believe it !
But I can be wrong...
However, you made me curious !
Can you post the URL with the program which does that ?
thank you
cp
| |
| Willem 2006-04-01, 7:55 am |
| cp wrote:
) Hi !
)
) As I told you yesterday, I took a file from those I have, and I can give you
) the following informations :
)
) size : 2,001,803 bits
) number of 1 : 475,367 (percent : 23.747 % )
) number of 0 : 1,526,437 (percent : 76.253 %)
)
) I executed your program , it gives that :
)
) After a run of 20 '0's, probability is 1.000000 (3/3)
) After a run of 19 '0's, probability is 0.998089 (1567/1570)
) ...
) After a run of 6 '1's, probability is 0.005000 (1/200)
) After a run of 7 '1's, probability is 0.000000 (0/1)
)
) have a nice day
Well, there seems to be a noticeable correlation with runlegnths.
Furthermore, I notice that the program has seen a run of 7 ones,
and also only three runs of 20 zeroes.
If you use those statistics, you should get it to 1578542 bits, if
I calculated correctly, that is. (I get 1583040 with single statistics).
Less than 0.5% saved (4500 bits), don't know if that's enough for your
purposes. Bit it's a lot more than the 200 bits you save from looking
only at runs of six ones.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| David A. Scott 2006-04-01, 6:55 pm |
| "cp" <cp@cp.fr> wrote in news:442e5395$0$21279$8fcfb975@news.wanadoo.fr:
>
> I just want to check what you said, because I don't believe it !
>
> But I can be wrong...
>
> However, you made me curious !
>
> Can you post the URL with the program which does that ?
>
> thank you
>
> cp
>
>
The basic 2 state bijective compress is at
http://bijective.dogma.net/arb2x.zip
it works on a file as if its a long string and
is adaptive and will compress bases on the ratio
of ones to zeroes.
http://bijective.dogma.net/a01tb01.zip
converts any length ascii string of "1" "0" to
a file in a bijective way. It would be wise to
use it to compress to a file and then use arb2x
Once you comvert string to basic file and compress
you can convert make to string and see the change
the string will be longer since the code is adaptive
but not much. If you use fixed ratios it will be
slightly shorter. If you connect seceral cells
together do it simalar as find in
arb255.zip which is the best adaptvie bijective 256
state symbol compressor for iid sources that I now of.
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
|
| are you sure about that ?
> http://bijective.dogma.net/a01tb01.zip
Not Found
The requested URL /a01tb01.zip was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an
ErrorDocument to handle the request.
| |
| David A. Scott 2006-04-01, 6:55 pm |
| "cp" <cp@cp.fr> wrote in news:442ea5b4$0$29203$8fcfb975@news.wanadoo.fr:
> are you sure about that ?
>
>
>
> Not Found
> The requested URL /a01tb01.zip was not found on this server.
>
> Additionally, a 404 Not Found error was encountered while trying to
> use an ErrorDocument to handle the request.
>
>
>
while I am not perfect zeroes and Ohs eyes and ones are often
switched it makes finding errors hard. ALso as you notice my english
sucks but check out whhoops it was nasty the tea for two air
http://bijective.dogma.net/compres11.htm
in the page are links to the softwars. It should work!
or try
http://bijective.dogma.net/a012b01.zip
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
| |
|
| hello !
it's ok now
but as I expected, these are not the results you told :
>each remaining one would use 2 bits and each
>remaining zero would use roughly .321928 bits.
it doesn't matter, it's under the entropy, but how did you
find these results ?
cp
| |
| David A. Scott 2006-04-02, 6:55 pm |
| "cp" <cp@cp.fr> wrote in news:442eb9f5$0$18309$8fcfb975@news.wanadoo.fr:
> hello !
>
> it's ok now
>
> but as I expected, these are not the results you told :
>
>
> it doesn't matter, it's under the entropy, but how did you
> find these results ?
>
>
> cp
>
>
>
First of all I stated you had to change my code to use the
fixed rates that you stated existed. The code of mine as it
stands is adaptive and would not give the results it should unless
you edited the code.
If the ones appear 25% of the time and the code was hard coded
for that you would use 2 bits per one or log(4)
the zero if it occurs 75% of the time it would appear log(4/3)
roughly 0.415037499278843818546261056052183
not sure where the .3219.. came from it was most likely an error
nay not of had coffee yet.
Note of the the symbol occurs with probabiliy P then the
log(1/P) is the number of bits use log for log base 2.
Or some people use - log(P) so take your pick.
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
|
|
|
|
|