Code Comments
Programming Forum and web based access to our favorite programming groups.Hello!! I have an arithmetic compressor that, at initialization, allow to me to set the number of symbols to compress. This is a port to c# that i have writted of c code of Mark Nelson. I've maked a lot of tests an work fine. My question is: When i take a lot of bytes (say, 100000) and push it into the compressor, i get a compress ratio (say 50%), but if i push the same bytes bit by bit i get more compression (say 60%). Is this ok?. If yes, are there another way to determine the best symbol size that try and error? Thanks a lot. Federico
Post Follow-up to this messageOn 22 Oct 2004 09:12:34 -0700, Federico wrote: > My question is: When i take a lot of bytes (say, 100000) and push it > into the compressor, i get a compress ratio (say 50%), but if i push > the same bytes bit by bit i get more compression (say 60%). What do you mean by "pushing into the compressor"? > Is this ok?. If yes, are there another way to determine the best > symbol size that try and error? No, this is odd. If you use the same model on the same files, you should be getting the same ratio IMHO. Eric -- Eric Bodden, ICQ: 12656220, http://www.bodden.de, PGP: BB465582 Arithmetic Coding - educational example code and more http://ac.bodden.de/
Post Follow-up to this message-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Eric Bodden <newsserver_mails@bodden.de> wrote: > On 22 Oct 2004 09:12:34 -0700, Federico wrote: > What do you mean by "pushing into the compressor"? > No, this is odd. If you use the same model on the same files, you should b e > getting the same ratio IMHO. > Eric Why odd? When we use the same model on the same "random source", we should get the same ratio. But now it is a specific file, just a sampling from a random source. Thus we can't acquire any knowledge about the original ramdon source. For an example. A file contains 0x01 0x02 0x04 0x08 0x10 can be a sample from a source: 0x01 probability = 0.2 0x02 probability = 0.2 0x04 probability = 0.2 0x08 probability = 0.2 0x10 probability = 0.2 we can't compress this source because it's an uniform distribution. And also it could be a sample from another source: 0 probability = 35/40 1 probability = 5/40 which can be compressed well. - -- PaulLiu(¼B¿oÂ@) E-mail address:PaulLiu.bbs@bbs.cis.nctu.edu.tw -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBej/ roQj7xTSiaUYRAqV9AJoDv7dKQ3bQ5MusdAXppWO zd1FWlgCbBl2r S23KZZ5/YG0zpTz1dEKYgNQ= =yl6E -----END PGP SIGNATURE-----
Post Follow-up to this messageYing-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> wrote in message news:<cldfa r$hc3$1@jupiter.ttn.net>... > For an example. A file contains 0x01 0x02 0x04 0x08 0x10 > can be a sample from a source: > > 0x01 probability = 0.2 > 0x02 probability = 0.2 > 0x04 probability = 0.2 > 0x08 probability = 0.2 > 0x10 probability = 0.2 > > we can't compress this source because it's an uniform distribution. > > And also it could be a sample from another source: > > 0 probability = 35/40 > 1 probability = 5/40 > > which can be compressed well. Good example, but it's not only applicable to arithmetic encoders. I actually took advantage of something similar to that when I wrote a VLSI sim dump database: when bitvectors are blasted apart in the value "string" table, values would compress much better as you could compress bit trains that were shifted off of byte boundaries and wouldn't have been all that compressible otherwise: let x be some long stream of bits, then we're presented with x in one instance and prefix bits + x + suffix bits in another, etc. ...this worked especially well with LZ/BWT compressor postprocessors as the data itself isn't really character-based data and the frequent (sub)string matches were able to overcome the 8x expansion from packed characters. As always, YMMV. -t
Post Follow-up to this messageYing-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> writes: > Eric Bodden <newsserver_mails@bodden.de> wrote: > > Why odd? When we use the same model on the same "random source", we > should get the same ratio. But now it is a specific file, just > a sampling from a random source. Thus we can't acquire any knowledge > about the original ramdon source. > > For an example. A file contains 0x01 0x02 0x04 0x08 0x10 > can be a sample from a source: > > 0x01 probability = 0.2 > 0x02 probability = 0.2 > 0x04 probability = 0.2 > 0x08 probability = 0.2 > 0x10 probability = 0.2 > > we can't compress this source because it's an uniform distribution. Nonsense. P(0x00) = 0.0 P(0x03) = 0.0 P(0x05) = 0.0 P(0x06) = 0.0 ... Phil -- They no longer do my traditional winks tournament lunch - liver and bacon. It's just what you need during a winks tournament lunchtime to replace lost ... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
Post Follow-up to this message-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Phil Carmody <thefatphil_demunged@yahoo.co.uk> wrote: > Ying-Chun Liu <PaulLiu.bbs@bbs.cis.nctu.edu.tw> writes: > Nonsense. > P(0x00) = 0.0 > P(0x03) = 0.0 > P(0x05) = 0.0 > P(0x06) = 0.0 Symbol has no meaning. What I want to say is when you choose 0x01 0x02 0x04 0x08 0x10 as symbols or 0 1 as symbols to represent the same data. And calculate the entropy function H(S) results different value. H(S) = £U -p(s)log p(s) s - -- PaulLiu(¼B¿oÂ@) E-mail address:PaulLiu.bbs@bbs.cis.nctu.edu.tw -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBe7NooQj7xTSiaUYRAgOkAJwPzeWSduCq QLbjpiZanIYnwPedtwCfVcNO nAjDyp1QNZddtooEyb+9Vm0= =Q6m8 -----END PGP SIGNATURE-----
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.