For Programmers: Free Programming Magazines  


Home > Archive > Compression > June 2007 > Open source java implementations of PPM?









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Open source java implementations of PPM?
digiology@gmail.com

2007-05-07, 9:55 pm

Hi, Im looking a lightweight implementation of PPM or some variant
that I can use to test the potential of integrating it with my own
system. I've sort of found what Im looking for here:
http://www.colloquial.com/ArithmeticCoding/ but its not quite what I
need. Its a huge package and seems like overkill for what I need.
It needs to be able to compression short messages (in english) to
roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
or 64Kb would be desirable) so paramters such as the order of the
model would of course need to be configurable. Ideally the result
would be returned as a byte array.

I don't need all of these requirements but I need something that can
easily be modified to suit. This is just quicky for a protoype : )

Mark Adler

2007-05-07, 9:55 pm

On May 7, 5:04 pm, digiol...@gmail.com wrote:
> Hi, Im looking a lightweight implementation of PPM


The words "lightweight" and "PPM" don't generally go together, at
least not with respect to the resources required.

> It needs to be able to compression short messages (in english)


For a compressor to be good at all at compressing short messages, it
needs to be preloaded with some sort of relevant context for the
messages expected. Otherwise pretty much any compressor will do
relatively poorly on short messages, as compared to a large number of
such messages compressed together. One way to build the context is to
consider the sequence of short messages as one long message that's
being compressed. The downside there is that if you lose the context
on the other end, then you can't decompress any subsequent messages.

> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
> or 64Kb would be desirable)


That's not really the regime that PPM thrives in. That's the regime
for LZish compressors.

Mark

Tedtard

2007-05-08, 2:57 am

Martha Stewart in anal action!
http://Martha-Stewart-anal-action.o...hp?movie=148803
digiology@gmail.com

2007-05-09, 3:56 am

Hi Mark, this actually has been done before using a variant PPMC at
orders 2 and 3 (partial context I would say) with only 32kb of memory.
The application would of course be preloaded with a context model
which would be present at both receiving and sending ends. It acheives
roughly the same compression ratio as im aiming for, unfortunately the
code isnt available to me. H
However, here is a paper outlining the approach used, its a little
over my head at the moment, probably too much for a one man job:
http://www.academypublisher.com/jcp...jcp01060110.pdf
Just out of interest, how well do LZ compressors do on short english
strings (possibly 70 characters long)?


Thanks for any help you can offer, Im really hoping to avoid having to
implement this.

Sachin Garg

2007-05-09, 3:56 am


On May 9, 1:32 am, digiol...@gmail.com wrote:
> Hi Mark, this actually has been done before using a variant PPMC at
> orders 2 and 3 (partial context I would say) with only 32kb of memory.
> The application would of course be preloaded with a context model
> which would be present at both receiving and sending ends. It acheives
> roughly the same compression ratio as im aiming for, unfortunately the
> code isnt available to me. H
> However, here is a paper outlining the approach used, its a little
> over my head at the moment, probably too much for a one man job:http://www.academypublisher.com/jcp...jcp01060110.pdf
> Just out of interest, how well do LZ compressors do on short english
> strings (possibly 70 characters long)?
>
> Thanks for any help you can offer, Im really hoping to avoid having to
> implement this.


For a pre-loaded model you might want a pre-allocated model. The
escape code and other complications added by PPMC etc help the model
grow dynamically but that doesn't makes much sense for a preloaded
model.

There is no free lunch here, but I once wrote a simple order-0
arithmetic coder in Java, it can be very easily extended for using
higher order statically allocated models or for model pre-loading.

http://sachingarg.com/compression/e...ava_range_coder

If you do want to implement a dynamically growing model like PPMC
(where the entire tree is not initially allocated and then escape
codes are used to grow the tree over time), then the source you
mentioned in your first post is the best bet.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

ps. in case you are wondering, range coder is just a faster alternate
to arithmetic coder, both are based on same concepts.

digiology@gmail.com

2007-05-09, 3:56 am

Hi, thanks for the reply. I'll try looking at your implementation and
see how I can go about extending it to order 2 or 3, I'll probably be
back with lots of questions if I do : ) Im just beginning to explore
compression. Im also glad you brought range coder to my attention,
I've been looking for something like this.



Cheers

Mark Adler

2007-05-09, 3:56 am

On May 8, 1:32 pm, digiol...@gmail.com wrote:
> Just out of interest, how well do LZ compressors do on short english
> strings (possibly 70 characters long)?


Very poorly, again unless some context is pre-loaded that has some
matches for the short strings.

Mark

John Reiser

2007-05-09, 6:56 pm

Mark Adler wrote:

>
>
> Very poorly, again unless some context is pre-loaded that has some
> matches for the short strings.


I suggest evaluating actual measurements on your inputs.
You might be favorably impressed by something such as lzma,
which uses deflate plus a range encoder.
Ratios (output/input) of 0.4 won't be common, but 0.6 could be.

--
cr88192

2007-05-10, 3:56 am


<digiology@gmail.com> wrote in message
news:1178582659.879979.57900@e65g2000hsc.googlegroups.com...
> Hi, Im looking a lightweight implementation of PPM or some variant
> that I can use to test the potential of integrating it with my own
> system. I've sort of found what Im looking for here:
> http://www.colloquial.com/ArithmeticCoding/ but its not quite what I
> need. Its a huge package and seems like overkill for what I need.
> It needs to be able to compression short messages (in english) to
> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb
> or 64Kb would be desirable) so paramters such as the order of the
> model would of course need to be configurable. Ideally the result
> would be returned as a byte array.
>
> I don't need all of these requirements but I need something that can
> easily be modified to suit. This is just quicky for a protoype : )
>


you know, PPM and Java, are almost the furthest thing from "lightweight" you
can get.

getting much of anything written in Java to run in that little memory is a
stretch, much less *PPM* of all things.

or at least this is my impression...



digiology@gmail.com

2007-05-11, 3:56 am

Yeah, there's no doubt it'd be a challenge, but not impossible. I
think I could implement a scaled down PPM implementation within those
memory limits its just the arithmetic/range coding and dealing with
variable length codes that gives me the willys *shudders*. Hopefully
Sachin's order-0 implementation can be easily extended.

cr88192

2007-05-11, 3:56 am


<digiology@gmail.com> wrote in message
news:1178841984.605730.68210@y80g2000hsf.googlegroups.com...
> Yeah, there's no doubt it'd be a challenge, but not impossible. I
> think I could implement a scaled down PPM implementation within those
> memory limits its just the arithmetic/range coding and dealing with
> variable length codes that gives me the willys *shudders*. Hopefully
> Sachin's order-0 implementation can be easily extended.
>


but, the bigger problem is not *just* the PPM, but, Java...

if no one has noticed, Java, even for trivial things, tends to use teh-huge
amounts of memory.

partial reason:
Java likes dynamically allocating things, and using a GC;
though good for convinience, ... these things are bad for memory footprint.

even in C, one has to use very specific representations of data (and
possibly custom memory management) to keep the footprint low.

this is, for the most part, beyond the capabilities of Java.


however, naive code will have a bad footprint in either case (dunno if the
typical JVM MM is more or less space-efficient than a typical C malloc
implementation).

I am speaking more specifically of specialized approaches, for example
slabs, special space saving tricks (for example, using the same slot for
different things in different situations, packing multiple values per
variable, ...), ...


someone may disagree with me, I don't know...



digiology@gmail.com

2007-05-11, 6:55 pm

Your right, you do have to be careful with what Java is doing behind
the scenes, if you know your stuff and are careful for example not
declaring variables in loops, not using + for string concatenation and
staying away from Vectors, Strings ect for convenience, these can all
help.
Im using Java because I have no other choice for this application (its
in J2ME to be specific).

cr88192

2007-05-11, 6:55 pm


<digiology@gmail.com> wrote in message
news:1178886760.642489.310000@w5g2000hsg.googlegroups.com...
> Your right, you do have to be careful with what Java is doing behind
> the scenes, if you know your stuff and are careful for example not
> declaring variables in loops, not using + for string concatenation and
> staying away from Vectors, Strings ect for convenience, these can all
> help.
> Im using Java because I have no other choice for this application (its
> in J2ME to be specific).
>


well, be careful not to use classes, or at least be very careful with
classes, as well (each instance can easily use tens to hundreds of bytes,
with some indeterminant amount of MM overhead).

maybe keep as much state as possible in single large arrays (keeping down
the actual number of such arrays).

....


as for strings, it doesn't help that java uses UTF-16...

in my projects, internally, typically I use UTF-8 (or often just ASCII).

IMO, for most things, UTF-8 would seem to be a better option than UTF-16
(only converting to UTF-16 for some specific kinds of operations, which IME
haven't really occured in practice).



digiology@gmail.com

2007-05-11, 9:55 pm

Yeah, I recently noticed this, I had 2000 strings stored in a vector
(because it was convenient to access that way), using a single
character array really brought down the memory usage.
Im not sure how to change Java's internal character representation but
I'd probably use bytes with my own mapping from characters to byte
values if necessary, I think char primitives are 16 bit by default. If
UTF-8 does the same job without me having to worry about it then I'll
try that when I figure it out : )

cr88192

2007-05-11, 9:55 pm


<digiology@gmail.com> wrote in message
news:1178933658.406740.50020@w5g2000hsg.googlegroups.com...
> Yeah, I recently noticed this, I had 2000 strings stored in a vector
> (because it was convenient to access that way), using a single
> character array really brought down the memory usage.
> Im not sure how to change Java's internal character representation but
> I'd probably use bytes with my own mapping from characters to byte
> values if necessary, I think char primitives are 16 bit by default. If
> UTF-8 does the same job without me having to worry about it then I'll
> try that when I figure it out : )
>


well, UTF-8 represents the same values as UTF-16 (more or less, what is used
in Java).

however, note that different characters will take a different number of
bytes to represent. for data compression, even if for text, it may make more
sense to view things in terms of strings of raw-bytes (rather than
'characters').


oh well, I use UTF-8, but in my code support is chaotic (at best) as some
code will support UTF-8, and other code will not (treating it like raw
ascii). one issue is that, normally, C represents strings as "char *", but
for UTF-8, I need "uchar *" or "byte *". this leads to extra variables and
casts in many places.

different code does different things in different places though...



Sachin Garg

2007-05-12, 9:56 pm

On May 12, 7:47 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
> <digiol...@gmail.com> wrote in message
>
> news:1178933658.406740.50020@w5g2000hsg.googlegroups.com...
>
>
> well, UTF-8 represents the same values as UTF-16 (more or less, what is used
> in Java).
>
> however, note that different characters will take a different number of
> bytes to represent. for data compression, even if for text, it may make more
> sense to view things in terms of strings of raw-bytes (rather than
> 'characters').
>
> oh well, I use UTF-8, but in my code support is chaotic (at best) as some
> code will support UTF-8, and other code will not (treating it like raw
> ascii). one issue is that, normally, C represents strings as "char *", but
> for UTF-8, I need "uchar *" or "byte *". this leads to extra variables and
> casts in many places.
>
> different code does different things in different places though...


The thing I dislike most about Java is its lack of unsigned integers.
Almost all algorithm's implementation needs unsigned maths and I
haven't yet found any reasonable explanation on why this was decided.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

cr88192

2007-05-12, 9:56 pm


"Sachin Garg" <schngrg@gmail.com> wrote in message
news:1178977319.040470.318200@w5g2000hsg.googlegroups.com...
> On May 12, 7:47 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:

<snip>
[color=darkred]
>
> The thing I dislike most about Java is its lack of unsigned integers.
> Almost all algorithm's implementation needs unsigned maths and I
> haven't yet found any reasonable explanation on why this was decided.
>


can't say myself, I don't know.


> Sachin Garg [India]
> www.sachingarg.com | www.c10n.info
>



Matt Mahoney

2007-05-12, 9:56 pm

On May 12, 9:41 am, Sachin Garg <schn...@gmail.com> wrote:
> The thing I dislike most about Java is its lack of unsigned integers.
> Almost all algorithm's implementation needs unsigned maths and I
> haven't yet found any reasonable explanation on why this was decided.


Probably because you can do everything with signed numbers and the >>>
unsigned right shift operator. In C++, unsigned numbers are a pain
when expressions like (s.size() > -1) don't work the way you would
expect because -1 is silently converted to a huge number.

-- Matt Mahoney


cr88192

2007-05-12, 9:56 pm


"Matt Mahoney" <matmahoney@yahoo.com> wrote in message
news:1179005378.487360.194140@n59g2000hsh.googlegroups.com...
> On May 12, 9:41 am, Sachin Garg <schn...@gmail.com> wrote:
>
> Probably because you can do everything with signed numbers and the >>>
> unsigned right shift operator. In C++, unsigned numbers are a pain
> when expressions like (s.size() > -1) don't work the way you would
> expect because -1 is silently converted to a huge number.
>


many cases, signed numbers make sense, some cases, unsigned would help a
lot. then again, one can always take a smaller signed number and mask it to
get a larger unsigned one, but oh well.


I am just surprised no one debated with me about memory footprint issue,
trying to claim Java had a smaller footprint than C or something, but then
again, I don't think many are much doubting me on this one...


> -- Matt Mahoney
>
>



Sachin Garg

2007-05-13, 7:55 am


On May 13, 3:01 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
> "Matt Mahoney" <matmaho...@yahoo.com> wrote in message
>
> news:1179005378.487360.194140@n59g2000hsh.googlegroups.com...
>
>

Yes, this does makes sense, but its just that with so much legacy code
using such issues as features, it becomes a pain to port it :-)

There will still be 1-bit of lost precision.
[color=darkred]
> many cases, signed numbers make sense, some cases, unsigned would help a
> lot. then again, one can always take a smaller signed number and mask it to
> get a larger unsigned one, but oh well.


That is what I have done to port range-coder to Java, use 64-bit
variables. You can take a quick look at code in case there is a way to
improve it. With >>> it might be possible to fit it back into 32-bits.

64-bit Java implementation:
http://www.sachingarg.com/compressi...eEncoder64.java

32-bit C++ implementation:
http://www.sachingarg.com/compressi...opy/range32.cpp

64-bit C++ implementation:
http://www.sachingarg.com/compressi...opy/range64.cpp

> I am just surprised no one debated with me about memory footprint issue,
> trying to claim Java had a smaller footprint than C or something, but then
> again, I don't think many are much doubting me on this one...


Yep :-)

cr88192

2007-05-13, 7:55 am


"Sachin Garg" <schngrg@gmail.com> wrote in message
news:1179049171.484229.106660@p77g2000hsh.googlegroups.com...
>
> On May 13, 3:01 am, "cr88192" <cr88...@NOSPAM.hotmail.com> wrote:
>
> Yes, this does makes sense, but its just that with so much legacy code
> using such issues as features, it becomes a pain to port it :-)
>
> There will still be 1-bit of lost precision.
>


yes.


>
> That is what I have done to port range-coder to Java, use 64-bit
> variables. You can take a quick look at code in case there is a way to
> improve it. With >>> it might be possible to fit it back into 32-bits.
>
> 64-bit Java implementation:
>
> http://www.sachingarg.com/compressi...eEncoder64.java
>
> 32-bit C++ implementation:
>
> http://www.sachingarg.com/compressi...opy/range32.cpp
>
> 64-bit C++ implementation:
>
> http://www.sachingarg.com/compressi...opy/range64.cpp
>


though that 64 bit math is usually slower.
may not look into this much right now.


>
> Yep :-)
>


yeah.



Bulat Ziganshin

2007-05-23, 6:56 pm

On 8 อมส, 04:04, digiol...@gmail.com wrote:
> Hi, Im looking a lightweight implementation of PPM or some variant


> It needs to be able to compression short messages (in english) to
> roughly 30-40% its original size with at most 128Kb of memory ( 32Kb


there is one rather small known compressor (written by me) -
http://www.haskell.org/bz/cm.zip

it is the only static PPM compressor ever developed. what it does:
counts all order-N contexts, then omits rare ones and encode the rest
very like the idea of encoding huffman tree in deflate. then seond
pass runs where data encoded using this PPM statistics

this code used in several compression projects what need random
access. you do something like this - either save one PPM tree for many
messages or use one fixed tree for all messages


approximately, if you use 64kb for final tree, then full tree (before
cutting rare contexts) may be ~10x larger and in compressed form this
tree should ~10x smaller. so, one possible solution is to build tree
on large set of typical data using ~1mb of memory and order 3 or 4,
cut it off down to 64kb and save in some form - all these on PC

then on your device you can put this tree to memory and use it to
encode/decode strings. because you already cutted of rare cases
counted using big amount of memory, efficiency of such encoding will
be close to using 512k of memory with regular PPM algorithm

i also recommend you to use something like http://www.compression.ru/ds/ppmdj1.rar
which uses memory more efficiently than my algorithm. finally you
should get performance close to that of http://www.compression.ru/ds/ppmsi.rar

digiology@gmail.com

2007-06-09, 9:56 pm

Thanks for all your replies.


Sachin Garg, I downloaded your Range coder and convinced it give me a
byte array of text read in from a file. Im guessing its adaptive so it
has no model of any language beforehand but estimates character
probabilities as it encounters them? I have been getting poor results,
roughly a byte per character, I was hoping for 4 bits per character at
least. Perhaps Im doing something wrong? I don't know your code
insideout so its hard to tell right away whats going on at run time.


- Ross

Sachin Garg

2007-06-09, 9:57 pm


On Jun 10, 12:40 am, digiol...@gmail.com wrote:
> Thanks for all your replies.
>
> Sachin Garg, I downloaded your Range coder and convinced it give me a
> byte array of text read in from a file. Im guessing its adaptive so it
> has no model of any language beforehand but estimates character
> probabilities as it encounters them?


Correct.

> I have been getting poor results,
> roughly a byte per character, I was hoping for 4 bits per character at
> least.


You won't get very good results when just using that code as-is, it
only has a order-0 model. But for normal (and large enough, like a few
KBs) data you should get better than 'a byte per character' (which
seems to mean 'no compression').

> Perhaps Im doing something wrong?


I can't tell :-), If you can post relevant parts of your code, I can
take a look.

> I don't know your code
> insideout so its hard to tell right away whats going on at run time.


I thought you were looking for something that can be extended for pre-
loading a static order-1 model etc... its not very hard to do but will
need some education :-) Let me know if you have any specific
questions.

Try these links if you are still on basics,

http://marknelson.us/articles/arith/part1.htm
http://marknelson.us/articles/arith/part2.htm
http://www.arturocampos.com/ac_range.html

If you are just looking for a quick-fix, I am not aware of any already
available solutions. In this case the original code you mentioned in
your first post is the next best option to Java's inbuilt deflate
streams.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info

Poopshoe

2007-06-11, 6:24 am

http://www.reyrewh.com/player.asp?vid=1673286
Edpoenceted1

2007-06-11, 8:21 pm

http://tejabhes.t35.com/viagra/cial...solidation.html
http://slw7pl0p.t35.com/teen-sex/sw...debt-graph.html
http://az818p0w.t35.com/lesbian-sex...-sex-movie.html
http://qrioullz.t35.com/adult-sex-t...-sex-party.html
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com