Code Comments
Programming Forum and web based access to our favorite programming groups.Hi, I want to compress data like dictionaries for mobile devices. So it should be able to decompress one entry without the need of decompressing large blocks of compressed data, because of slow CPU, low RAM and Java. Length of each entry is 1 to 255 characters. I implemented a Huffman de-/compression: 1. 'feed' the huffman tree with all single dictionary entries 2. compress each single entry separately (same huffman table) Now each compressed entry is accessable Loosing about 4 bit space per entry is acceptable. (each compressed string is saved into an single array of bytes) The compression rate is about 100 : 60-70 Now I wonder if there is a possibility to encrease compression. - reducing redundant character sequences, algorithm ? - BWT and MtF for each entry ? any suggestion? regards Wolfgang
Post Follow-up to this message"Wolfgang W." <wolfgangwendt@web.de> writes: > Hi, > I want to compress data like dictionaries for mobile devices. > > So it should be able to decompress one entry without the need of > decompressing large blocks of compressed data, because of > slow CPU, low RAM and Java. > > Length of each entry is 1 to 255 characters. > > I implemented a Huffman de-/compression: > > 1. 'feed' the huffman tree with all single dictionary entries > 2. compress each single entry separately (same huffman table) > > Now each compressed entry is accessable > > Loosing about 4 bit space per entry is acceptable. > (each compressed string is saved into an single array of bytes) > > The compression rate is about 100 : 60-70 > > Now I wonder if there is a possibility to encrease compression. > - reducing redundant character sequences, algorithm ? > - BWT and MtF for each entry ? > > any suggestion? Have 26 sub-dictionaries, one begining with each letter. Each word you store will now be 1 letter shorter as you don't need to store the first letter. Recurse. Bonus points for merging identical tails of words. Phil -- They no longer do my traditional winks tournament lunch - liver and bacon. It's just what you need during a winks tournament lunchtime to replace lost ... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
Post Follow-up to this message
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> schrieb im Newsbeitrag
news:873c1o23s5.fsf@nonospaz.fatphil.org...
> "Wolfgang W." <wolfgangwendt@web.de> writes:
>
>
> Have 26 sub-dictionaries, one begining with each letter. Each word
> you store will now be 1 letter shorter as you don't need to store
> the first letter.
Thx, but not practicable because of full word indexing the whole dictionary.
If the above sentence is an dictionary entry and 'bec' is searched, this
entry
should be selelected too. Not only entries like 'bec....'.
The dictionary is not and could not be sorted.
To access each entry I use an index like: [entry; word].
My idea is to remove all redundancy, stores redundant sequences uniquly into
an index table, exchange each redundant sequence by its index, build the
huffman compression table and compress each entry.
But the big question is: Does this make sence?
Because I am not an expert, I don't know how huffman is working with the
resulting large index entries and the resulting worse (or I'm wrong?) weight
distribution.
example:
the auto
the house
the automobile
would result in:
[1]
[0]house
[1]mobile
weights:
the auohsmbil[0][1]
index:
[0] -> 'the '
[1] -> [0]auto
compress:
{0}:'the '
{1}:[0]auto
[1]
[0]house
[1]mobile
index entries would be randomly accessable too.
Would be quite hard work.
Anybody knows if it could increase compression dramatically
or where I could find this algorithm explained or implemented?
regards
Wolfgang
Post Follow-up to this message> Have 26 sub-dictionaries, one begining with each letter. Each word > you store will now be 1 letter shorter as you don't need to store > the first letter. > > Recurse. > > Bonus points for merging identical tails of words. Hi: There is out there in internet, a dictionary implementation made by "Jesus Carreteros". I found it some years ago when I had to code a dictionary for some company. Perhaps it's a good start to see how Jesus breaks words in a very clever way, so there is no redundancy characters. Hope this help.
Post Follow-up to this message"Wolfgang W." <wolfgangwendt@web.de> writes:
>
> Thx, but not practicable because of full word indexing the whole dictionar
y.
>
> If the above sentence is an dictionary entry and 'bec' is searched, this
> entry
> should be selelected too. Not only entries like 'bec....'.
>
> The dictionary is not and could not be sorted.
> To access each entry I use an index like: [entry; word].
In that case, then in order to see if any string matches a word in the
dictionary as a substring you need to decompress every such word every
time. You've traded space for time, and got a massive hit in the time
department.
> My idea is to remove all redundancy, stores redundant sequences uniquly in
to
> an index table, exchange each redundant sequence by its index, build the
> huffman compression table and compress each entry.
>
> But the big question is: Does this make sence?
>
> Because I am not an expert, I don't know how huffman is working with the
> resulting large index entries and the resulting worse (or I'm wrong?) weig
ht
> distribution.
>
> example:
>
> the auto
> the house
> the automobile
>
> would result in:
>
> [1]
> [0]house
> [1]mobile
And a search for 'he au' would fail, as that substring does not appear in
the two records after you've replaced the sequence 'the auto' with a single
token.
> weights:
> the auohsmbil[0][1]
>
> index:
> [0] -> 'the '
> [1] -> [0]auto
>
> compress:
> {0}:'the '
> {1}:[0]auto
> [1]
> [0]house
> [1]mobile
>
> index entries would be randomly accessable too.
>
>
> Would be quite hard work.
> Anybody knows if it could increase compression dramatically
> or where I could find this algorithm explained or implemented?
Replacement of common substrings has been tried in the past with varying
levels of success. It's quite closely related to how LZW works, but you
can work out your fixed-size dictionary in advance.
Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver. -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'
Post Follow-up to this message"Fernando Lagos" <lagos_fernando@hotmail.com> schrieb im Newsbeitrag news:67198833.0409111650.c67e703@posting.google.com... > There is out there in internet, a dictionary implementation made by > "Jesus Carreteros". I found it some years ago when I had to code a > dictionary for some company. > > Perhaps it's a good start to see how Jesus breaks words in a very > clever way, so there is no redundancy characters. Thanks Fernando. That would help very much. But I can't find anything. Do you have more information about that? regards Wolfgang
Post Follow-up to this message> > There is out there in internet, a dictionary implementation made by > > Thanks Fernando. That would help very much. > But I can't find anything. > Do you have more information about that? Ups!! There is nothing now... that's strange... Perhaps I have a name mistake. I'll try to find my old project.
Post Follow-up to this message> Ups!! There is nothing now... that's strange... Perhaps I have a name > mistake. I'll try to find my old project. Is "Jesus Carretero". Here you got some links: ftp.fi.upm.es/pub/unix/espa%7Enol.tar.gz http://www.arcos.inf.uc3m.es/~jcarrete/ http://sourceforge.net/project/show...lease_id=103216 Hope this help. Regards.
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.