Home > Archive > Compression > October 2004 > Arithmetic compression
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Arithmetic compression
|
|
| Federico 2004-10-22, 3:55 pm |
| Hello!!
I have an arithmetic compressor that, at initialization, allow to me
to set the number of symbols to compress. This is a port to c# that i
have writted of c code of Mark Nelson. I've maked a lot of tests an
work fine.
My question is: When i take a lot of bytes (say, 100000) and push it
into the compressor, i get a compress ratio (say 50%), but if i push
the same bytes bit by bit i get more compression (say 60%).
Is this ok?. If yes, are there another way to determine the best
symbol size that try and error?
Thanks a lot.
Federico
| |
| Eric Bodden 2004-10-22, 3:55 pm |
| On 22 Oct 2004 09:12:34 -0700, Federico wrote:
> My question is: When i take a lot of bytes (say, 100000) and push it
> into the compressor, i get a compress ratio (say 50%), but if i push
> the same bytes bit by bit i get more compression (say 60%).
What do you mean by "pushing into the compressor"?
> Is this ok?. If yes, are there another way to determine the best
> symbol size that try and error?
No, this is odd. If you use the same model on the same files, you should be
getting the same ratio IMHO.
Eric
--
Eric Bodden, ICQ: 12656220, http://www.bodden.de, PGP: BB465582
Arithmetic Coding - educational example code and more
http://ac.bodden.de/
| |
| Ying-Chun Liu 2004-10-23, 8:55 am |
| -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Eric Bodden <newsserver_mails@bodden.de> wrote:
> On 22 Oct 2004 09:12:34 -0700, Federico wrote:
> What do you mean by "pushing into the compressor"?
> No, this is odd. If you use the same model on the same files, you should be
> getting the same ratio IMHO.
> Eric
Why odd? When we use the same model on the same "random source", we
should get the same ratio. But now it is a specific file, just
a sampling from a random source. Thus we can't acquire any knowledge
about the original ramdon source.
For an example. A file contains 0x01 0x02 0x04 0x08 0x10
can be a sample from a source:
0x01 probability = 0.2
0x02 probability = 0.2
0x04 probability = 0.2
0x08 probability = 0.2
0x10 probability = 0.2
we can't compress this source because it's an uniform distribution.
And also it could be a sample from another source:
0 probability = 35/40
1 probability = 5/40
which can be compressed well.
- --
PaulLiu(¼B¿oÂ@)
E-mail address:PaulLiu.bbs@bbs.cis.nctu.edu.tw
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFBej/ roQj7xTSiaUYRAqV9AJoDv7dKQ3bQ5MusdAXppWO
zd1FWlgCbBl2r
S23KZZ5/YG0zpTz1dEKYgNQ=
=yl6E
-----END PGP SIGNATURE-----
| |
| Anthony J Bybell 2004-10-23, 8:55 pm |
| Ying-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> wrote in message news:<cldfar$hc3$1@jupiter.ttn.net>...
> For an example. A file contains 0x01 0x02 0x04 0x08 0x10
> can be a sample from a source:
>
> 0x01 probability = 0.2
> 0x02 probability = 0.2
> 0x04 probability = 0.2
> 0x08 probability = 0.2
> 0x10 probability = 0.2
>
> we can't compress this source because it's an uniform distribution.
>
> And also it could be a sample from another source:
>
> 0 probability = 35/40
> 1 probability = 5/40
>
> which can be compressed well.
Good example, but it's not only applicable to arithmetic encoders. I
actually took advantage of something similar to that when I wrote a
VLSI sim dump database: when bitvectors are blasted apart in the value
"string" table, values would compress much better as you could
compress bit trains that were shifted off of byte boundaries and
wouldn't have been all that compressible otherwise:
let x be some long stream of bits, then we're presented with
x in one instance
and prefix bits + x + suffix bits in another, etc.
....this worked especially well with LZ/BWT compressor postprocessors
as the data itself isn't really character-based data and the frequent
(sub)string matches were able to overcome the 8x expansion from packed
characters. As always, YMMV.
-t
| |
| Phil Carmody 2004-10-24, 8:55 am |
| Ying-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> writes:
> Eric Bodden <newsserver_mails@bodden.de> wrote:
>
> Why odd? When we use the same model on the same "random source", we
> should get the same ratio. But now it is a specific file, just
> a sampling from a random source. Thus we can't acquire any knowledge
> about the original ramdon source.
>
> For an example. A file contains 0x01 0x02 0x04 0x08 0x10
> can be a sample from a source:
>
> 0x01 probability = 0.2
> 0x02 probability = 0.2
> 0x04 probability = 0.2
> 0x08 probability = 0.2
> 0x10 probability = 0.2
>
> we can't compress this source because it's an uniform distribution.
Nonsense.
P(0x00) = 0.0
P(0x03) = 0.0
P(0x05) = 0.0
P(0x06) = 0.0
....
Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
.... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
| |
| Ying-Chun Liu 2004-10-24, 3:55 pm |
| -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote:
> Ying-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> writes:
> Nonsense.
> P(0x00) = 0.0
> P(0x03) = 0.0
> P(0x05) = 0.0
> P(0x06) = 0.0
Symbol has no meaning.
What I want to say is when you choose 0x01 0x02 0x04 0x08 0x10 as symbols
or 0 1 as symbols to represent the same data.
And calculate the entropy function H(S) results different value.
H(S) = £U -p(s)log p(s)
s
- --
PaulLiu(¼B¿oÂ@)
E-mail address:PaulLiu.bbs@bbs.cis.nctu.edu.tw
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFBe7NooQj7xTSiaUYRAgOkAJwPzeWSduCq
QLbjpiZanIYnwPedtwCfVcNO
nAjDyp1QNZddtooEyb+9Vm0=
=Q6m8
-----END PGP SIGNATURE-----
|
|
|
|
|