Home > Archive > Compression > June 2007 > compressing short XML messages without including dictionary or huffman table
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
compressing short XML messages without including dictionary or huffman table
|
|
| benedict 2007-06-01, 7:55 am |
|
I have to compress some XML messages, each only 300-600 chars long.
There is a very low upper bound on the size of the compressed
messages, so I cannot afford to transmit any dictionary or Huffman
encoding table along with each message, but must 'hardwire' them into
the receiver.
1. I am looking for advice on a general approach to compressing short
messages, to be transmitted without dictionary/table overhead.
2. I'd ideally like to find a Java implementation of an appropriate
algorithm.
3. I have already investigated JZLib at some length. For some
messages the performance is acceptable, but I find inexplicable
oscillations in compressed sizes. e.g. a message of 160 chars
compresses to 143 bytes, appending one char compresses 161 chars to
48 bytes and adding a second char compresses 162 chars to 144 bytes!
I don't understand the sudden large steps up and down in compressed
sizes.
I'd be really grateful for any help/advice.
Benedict
| |
| collection60@googlemail.com 2007-06-01, 6:55 pm |
| > 1. I am looking for advice on a general approach to compressing short
> messages, to be transmitted without dictionary/table overhead.
I'm no expert, by far... but just a guess... perhaps use the deflate
algorithm, on it's -9 setting. It's a good start.
If your XML was of a specific vocabulary, let's say it was svg files
you were transmitting, you could refer to a specific svg dictionary
for compression. :) Of course, if your XML files are of many different
vocabs.... I don't think this will help you.
> I don't understand the sudden large steps up and down in compressed
> sizes.
An inefficient algorithm, simply. The code to detect patterns is sub
optimal. It loses patterns and misses them due to limitations of the
code.
| |
| Mark Adler 2007-06-01, 6:55 pm |
| On Jun 1, 5:37 am, benedict <benedicth...@googlemail.com> wrote:
> but I find inexplicable
> oscillations in compressed sizes. e.g. a message of 160 chars
> compresses to 143 bytes, appending one char compresses 161 chars to
> 48 bytes and adding a second char compresses 162 chars to 144 bytes!
> I don't understand the sudden large steps up and down in compressed
> sizes.
Neither do I! I'd like to try to understand that -- it smells like a
bug. Can you send me those examples? Thanks.
Mark Adler
| |
| Mark Adler 2007-06-01, 6:55 pm |
| On Jun 1, 5:37 am, benedict <benedicth...@googlemail.com> wrote:
> 1. I am looking for advice on a general approach to compressing short
> messages, to be transmitted without dictionary/table overhead.
First off, you should look at XML-WRT, which preprocesses XML for
better compression:
http://xml-wrt.sourceforge.net/
Second, you should consider sending the short messages in a sequence
as part of a larger zlib stream. That way later messages can match
strings from older messages to improve the compression. zlib provides
sync markers so you can break up the stream. Also the uncompressed
data can have markers in it. Every once in a while you would reset
the history and start over -- every hundred messages or so would have
little impact on the compression gain.
Mark Adler
| |
| benedict 2007-06-02, 7:56 am |
| On Jun 1, 6:51 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:
> On Jun 1, 5:37 am,benedict<benedicth...@googlemail.com> wrote:
>
>
> First off, you should look at XML-WRT, which preprocesses XML for
> better compression:
>
> http://xml-wrt.sourceforge.net
Many thanks for this pointer - extremely useful & relevant.
>
> Second, you should consider sending the short messages in a sequence
> as part of a larger zlib stream. That way later messages can match
> strings from older messages to improve the compression. zlib provides
> sync markers so you can break up the stream. Also the uncompressed
> data can have markers in it. Every once in a while you would reset
> the history and start over -- every hundred messages or so would have
> little impact on the compression gain.
That sounds interesting, but ly I'm very constrained in the size of
the zlib stream - it has to fit in a couple of SMS messages, so each
message really has to be sent independently. That's why I hoped that
hard-wiring in the receiver any dictionary or huffman table might be
the way forward. ZLib provides a setDictionary(), but doesn't define
the format, and I can't see how to avoid transmitting the Huffman tree
with every message.
| |
| Mark Adler 2007-06-02, 6:55 pm |
| On Jun 2, 5:31 am, benedict <benedicth...@googlemail.com> wrote:
> That sounds interesting, but ly I'm very constrained in the size of
> the zlib stream - it has to fit in a couple of SMS messages, so each
> message really has to be sent independently.
You can have the same effect by sending a series of independent zlib
streams that each need a dictionary. Simply build a sliding
dictionary of the last 32K worth of previous messages in the order
sent. Then that dictionary can be reconstructed at the other end to
decompress each zlib stream in sequence.
> That's why I hoped that
> hard-wiring in the receiver any dictionary or huffman table might be
> the way forward. ZLib provides a setDictionary(), but doesn't define
> the format,
That's because there is no format. The dictionary at any point in
time is simply the 32K bytes preceding the byte currently being
compressed. Providing a dictionary then can benefit the compression
of the first 32K (and only the first 32K). To make a "good"
dictionary, there simply needs to be a good likelihood of strings in
the message matching strings in the dictionary, and of course it has
be to reconstructable at the other end, or already there. So the
"format" is just 32K bytes of data that you have some reason to
believe is correlated to the data you plan to compress.
Mark
| |
|
|
| benedict 2007-06-04, 6:55 pm |
|
>
> That's because there is no format. The dictionary at any point in
> time is simply the 32K bytes preceding the byte currently being
> compressed. Providing a dictionary then can benefit the compression
> of the first 32K (and only the first 32K). To make a "good"
> dictionary, there simply needs to be a good likelihood of strings in
> the message matching strings in the dictionary, and of course it has
> be to reconstructable at the other end, or already there. So the
> "format" is just 32K bytes of data that you have some reason to
> believe is correlated to the data you plan to compress.
>
Thank you - I beginning to understand! I'm now using a dictionary,
but am a bit surprised at the results.
If I use the whole message as the dictionary (!), I do get extremely
good compression.
In lots of other cases, using a dictionary actually increases the
output size.
For example, build a string by replicating 45 times 'abcxyz' . Then
compress it using various dictionaries.
All dictionaries except 'abcxyz' actually caused the compressed output
to be longer than not using any dictionary.
I thought that adding a dictionary should not cause any increase in
output size - at best I thought it would be neutral.
input input length, output bytes, output/input as percent,
dictionary string
abcxyz*45 270 => 19 ( 7.0%) :: i.e. no dictionary
abcxyz*45 270 => 22 ( 8.1%) :abc: - why should this
increase the output size?
abcxyz*45 270 => 22 ( 8.1%) :xyz:
abcxyz*45 270 => 19 ( 7.0%) :abcxyz:
abcxyz*45 270 => 20 ( 7.4%) :xyzabc: - suprisingly not
quite the same uas using 'abcxyz'
abcxyz*45 270 => 21 ( 7.8%) :abc xyz: - slightly worse
than 'abcxyz' without the space
So, why is it that specifiying a dictionary can actually increase
output size?
Many thanks for your help,
Benedict
| |
| Mark Adler 2007-06-04, 6:55 pm |
| On Jun 4, 8:54 am, benedict <benedicth...@googlemail.com> wrote:
> abcxyz*45 270 => 19 ( 7.0%) :: i.e. no dictionary
> abcxyz*45 270 => 22 ( 8.1%) :abc: - why should this
> increase the output size?
> abcxyz*45 270 => 22 ( 8.1%) :xyz:
> abcxyz*45 270 => 19 ( 7.0%) :abcxyz:
> abcxyz*45 270 => 20 ( 7.4%) :xyzabc: - suprisingly not
> quite the same uas using 'abcxyz'
> abcxyz*45 270 => 21 ( 7.8%) :abc xyz: - slightly worse
> than 'abcxyz' without the space
>
> So, why is it that specifiying a dictionary can actually increase
> output size?
You are looking at extremely small variations in the output size (one
or two bytes), which is not statistically meaningful. In addition,
you are starting with an example that is already highly compressible,
so further gain is rather limited. You should try more realistic
examples, e.g. the XML messages that started the discussion.
Mark
| |
| cr88192 2007-06-05, 9:55 pm |
|
"benedict" <benedictheal@googlemail.com> wrote in message
news:1180701466.372214.8720@h2g2000hsg.googlegroups.com...
>
> I have to compress some XML messages, each only 300-600 chars long.
> There is a very low upper bound on the size of the compressed
> messages, so I cannot afford to transmit any dictionary or Huffman
> encoding table along with each message, but must 'hardwire' them into
> the receiver.
>
> 1. I am looking for advice on a general approach to compressing short
> messages, to be transmitted without dictionary/table overhead.
>
> 2. I'd ideally like to find a Java implementation of an appropriate
> algorithm.
>
> 3. I have already investigated JZLib at some length. For some
> messages the performance is acceptable, but I find inexplicable
> oscillations in compressed sizes. e.g. a message of 160 chars
> compresses to 143 bytes, appending one char compresses 161 chars to
> 48 bytes and adding a second char compresses 162 chars to 144 bytes!
> I don't understand the sudden large steps up and down in compressed
> sizes.
>
>
>
> I'd be really grateful for any help/advice.
>
dunno if relevant/interesting, but before I had experimented with customized
serializers/deserializers for XML (and in other cases, for other kinds of
structured data, such as s-expressions).
the basic idea is that we can view the XML, not as text, but as a
tree-structured stream of tags and data. so, we can build a dictionary of
tags (a new tag is encoded literally, if new, but is otherwise just reused).
given the high frequency of similararily between different uses of a tag, we
can start merging other pieces of info (such whether or not attributes or
contents were present, particular attribute tags, ...) into the tag itself
(using more dictionary entries, but saving bytes).
likewise, pieces of the tree can also be stored temporarily, and reused as
needed (for example, in a kind of rotator).
additionally, a kind of LZW or LZP based algo can be used when encoding
strings.
sometimes, rather than using a purely declarative serialization, I have
based the serialization on a kind of tweaky and specialized stack machine.
with some tweaking, all this can do fairly well, and can be implemented
absent resorting to huffman or a traditional compressor.
(but in the end it all wasn't all that useful).
or something...
> Benedict
>
| |
| danilobrambilla@tiscali.it 2007-06-06, 9:56 pm |
| On 1 Giu, 14:37, benedict <benedicth...@googlemail.com> wrote:
> I have to compress some XML messages, each only 300-600 chars long.
> There is a very low upper bound on the size of the compressed
> messages, so I cannot afford to transmit any dictionary or Huffman
> encoding table along with each message, but must 'hardwire' them into
> the receiver.
>
> 1. I am looking for advice on a general approach to compressing short
> messages, to be transmitted without dictionary/table overhead.
>
> 2. I'd ideally like to find a Java implementation of an appropriate
> algorithm.
>
> 3. I have already investigated JZLib at some length. For some
> messages the performance is acceptable, but I find inexplicable
> oscillations in compressed sizes. e.g. a message of 160 chars
> compresses to 143 bytes, appending one char compresses 161 chars to
> 48 bytes and adding a second char compresses 162 chars to 144 bytes!
> I don't understand the sudden large steps up and down in compressed
> sizes.
>
> I'd be really grateful for any help/advice.
>
> Benedict
Hi
you can try dinamic Huffman compression using Vitter Algorithm. No
need to transmit the dictionary because build on receiver as receivig
data. You can also implement it using 2 arrays for speed up things
instead of trees.
http://en.wikipedia.org/wiki/Adaptive_Huffman_coding
Best Regards, Danilo
| |
| benedict 2007-06-06, 9:56 pm |
| > You are looking at extremely small variations in the output size (one
> or two bytes), which is not statistically meaningful. In addition,
> you are starting with an example that is already highly compressible,
> so further gain is rather limited. You should try more realistic
> examples, e.g. the XML messages that started the discussion.
Thanks for all the comments. I'm sorry if my artificial example
proved a red-herring. I was just puzzled that it should ever happen
that specifying a dictionary should actually increase the output size.
If I compress my real XML (838 chars of which 230 are xml tags and 608
text in attributes) using various different dictionaries, I get:
a) using the message itself as dictionary (!) just to show it works
838 => 23 ( 2.7%)
b) using no dictionary at all
838 => 401 ( 47.9%)
c) using as dictionary the XML tags from the message, got by
stripping out the data content from all attributes
i.e. <f i=""h=""><ch i=""l=""vs=''ty=""/><ch i=""l=""vs=''ty=""/><ch
i="" .......
I cannot use English text in the dictionary, as the system is multi-
lingual.
838 => 390 ( 46.5%)
1. Should I be surprised that I cannot do better than 47.9% with no
dictionary? The literature shows results of 19-27% for running text
(doubtless much longer than mine). I would have expected even a simple
Huffman tree to do better than this, but that's from a position of
totally ignorance.
2. I'm surprised that using as dictionary all the XML tags gives only
another 1% gain. As 1/4 of the text is XML, am I misguided in
expecting more gain?
Benedict
| |
| benedict 2007-06-06, 9:56 pm |
|
> you can try dinamic Huffman compression using Vitter Algorithm. No
> need to transmit the dictionary because build on receiver as receivig
> data. You can also implement it using 2 arrays for speed up things
> instead of trees.
>
> http://en.wikipedia.org/wiki/Adaptive_Huffman_coding
Many thanks for this lead, which I didn't know about. I'll certainly
give it a try.
benedict
| |
| Mark Adler 2007-06-06, 9:56 pm |
| On Jun 6, 11:18 am, benedict <benedicth...@googlemail.com> wrote:
> c) using as dictionary the XML tags from the message, got by
> stripping out the data content from all attributes
> i.e. <f i=""h=""><ch i=""l=""vs=''ty=""/><ch i=""l=""vs=''ty=""/><ch
> i="" .......
>
> 1. Should I be surprised that I cannot do better than 47.9% with no
> dictionary?
I'm surprised you got that much for a < 1K message. To see how well
you can do with a lot of data, simply concatenate a lot of these
messages (different from each other) and compare the compression of
that with the sum of the compressions of the individual messages.
> 2. I'm surprised that using as dictionary all the XML tags gives only
> another 1% gain. As 1/4 of the text is XML, am I misguided in
> expecting more gain?
>From the tag examples, I wouldn't expect much gain. The parts of the
tags that can be matched are very short, e.g. '<f i="' or '"/>'.
Effectively the tags are already compressed, since they have one or
two character names. For short strings, the difference between coding
the string as Huffman literals (which is what you're getting without
the dictionary for the first occurrences) and coding the string as an
offset and length to a previous string is not very large.
Furthermore, if those tags appear in the message more than once,
you'll only get gain from the first appearance. The second appearance
gets matched to the first one, whether or not you provide a
dictionary.
Mark
| |
| collection60@googlemail.com 2007-06-16, 3:56 am |
| I had the idea of making a compression scheme which could encode 127
symbols as one byte. Basically, it used a low byte (0-127) to take a
symbol from a 127 long array of symbols.
The high bytes had other special abilities.
Now if I just get some time to make the algorithm ;)
I wonder if this technique of mine would help in this case where there
are a lot of short strings to compress?
Mine can acheive a maximum theoretical compression of approaching 50%
for the case where we have only patterns that are 2 bytes long, but
are randomly arranged.
Of course the reality of my algorithm when I get around to completing
it, may leave much to be desired :)
Any guesses on if such an algorithm can help compress? For what it's
worth... the overhead for adding a new symbol of less than 8 bytes, to
a dictionary is 1 byte.
I have 7 special "escape and add to the next free slot in dictionary"
codes :) One for each length from 2 to 8 bytes.
| |
| collection60@googlemail.com 2007-06-16, 3:56 am |
| On Jun 15, 10:48 pm, Josiah Carlson <josiah.carl...@sbcglobal.net>
wrote:
> collectio...@googlemail.com wrote:
>
> Implement it. Until then, it's just navel gazing.
>
> - Josiah
I have implemented the decoder ;) Just not the encoder, which will be
many times harder than the decoder of course.
And it's not navel gazing. It's exactly what I said before. I've been
out of time. But thanks for the statements about someone you know
nothing about.
|
|
|
|
|