Code Comments

Programming Forum and web based access to our favorite programming groups.
For Programmers: Free Programming Magazines | New: Database administration forum
Registration is free! Edit your profileCalendarFind other membersFrequently Asked QuestionsSearch -> 
Post New Thread











Thread
Author

dictionary compression
Hi,

I want to compress data like dictionaries for mobile devices.

So it should be able to decompress one entry without the need of
decompressing large blocks of compressed data, because of
slow CPU, low RAM and Java.

Length of each entry is 1 to 255 characters.

I implemented a Huffman de-/compression:

1. 'feed' the huffman tree with all single dictionary entries
2. compress each single entry separately (same huffman table)

Now each compressed entry is accessable

Loosing about 4 bit space per entry is acceptable.
(each compressed string is saved into an single array of bytes)

The compression rate is about 100 : 60-70

Now I wonder if there is a possibility to encrease compression.
- reducing redundant character sequences, algorithm ?
- BWT and MtF for each entry ?

any suggestion?

regards
Wolfgang



Report this thread to moderator Post Follow-up to this message
Old Post
Wolfgang W.
09-11-04 01:55 PM


Re: dictionary compression
"Wolfgang W." <wolfgangwendt@web.de> writes:
> Hi,

> I want to compress data like dictionaries for mobile devices.
>
> So it should be able to decompress one entry without the need of
> decompressing large blocks of compressed data, because of
> slow CPU, low RAM and Java.
>
> Length of each entry is 1 to 255 characters.
>
> I implemented a Huffman de-/compression:
>
>  1. 'feed' the huffman tree with all single dictionary entries
>  2. compress each single entry separately (same huffman table)
>
> Now each compressed entry is accessable
>
> Loosing about 4 bit space per entry is acceptable.
> (each compressed string is saved into an single array of bytes)
>
> The compression rate is about 100 : 60-70
>
> Now I wonder if there is a possibility to encrease compression.
> - reducing redundant character sequences, algorithm ?
> - BWT and MtF for each entry ?
>
> any suggestion?

Have 26 sub-dictionaries, one begining with each letter. Each word
you store will now be 1 letter shorter as you don't need to store
the first letter.

Recurse.

Bonus points for merging identical tails of words.

Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver.   -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'

Report this thread to moderator Post Follow-up to this message
Old Post
Phil Carmody
09-11-04 08:55 PM


Re: dictionary compression
"Phil Carmody" <thefatphil_demunged@yahoo.co.uk> schrieb im Newsbeitrag
news:873c1o23s5.fsf@nonospaz.fatphil.org...
> "Wolfgang W." <wolfgangwendt@web.de> writes: 
> 
>
> Have 26 sub-dictionaries, one begining with each letter. Each word
> you store will now be 1 letter shorter as you don't need to store
> the first letter.

Thx, but not practicable because of full word indexing the whole dictionary.

If the above sentence is an dictionary entry and 'bec' is searched, this
entry
should be selelected too. Not only entries like 'bec....'.

The dictionary is not and could not be sorted.
To access each entry I use an index like: [entry; word].

My idea is to remove all redundancy, stores redundant sequences uniquly into
an index table, exchange each redundant sequence by its index, build the
huffman compression table and compress each entry.

But the big question is: Does this make sence?

Because I am not an expert, I don't know how huffman is working with the
resulting large index entries and the resulting worse (or I'm wrong?) weight
distribution.

example:

the auto
the house
the automobile

would result in:

[1]
[0]house
[1]mobile

weights:
the auohsmbil[0][1]

index:
[0] -> 'the '
[1] -> [0]auto

compress:
{0}:'the '
{1}:[0]auto
[1]
[0]house
[1]mobile

index entries would be randomly accessable too.


Would be quite hard work.
Anybody knows if it could increase compression dramatically
or where I could find this algorithm explained or implemented?

regards
Wolfgang



Report this thread to moderator Post Follow-up to this message
Old Post
Wolfgang W.
09-11-04 08:55 PM


Re: dictionary compression
> Have 26 sub-dictionaries, one begining with each letter. Each word
> you store will now be 1 letter shorter as you don't need to store
> the first letter.
>
> Recurse.
>
> Bonus points for merging identical tails of words.


Hi:

There is out there in internet, a dictionary implementation made by
"Jesus Carreteros". I found it some years ago when I had to code a
dictionary for some company.

Perhaps it's a good start to see how Jesus breaks words in a very
clever way, so there is no redundancy characters.

Hope this help.

Report this thread to moderator Post Follow-up to this message
Old Post
Fernando Lagos
09-12-04 08:55 AM


Re: dictionary compression
"Wolfgang W." <wolfgangwendt@web.de> writes: 
>
> Thx, but not practicable because of full word indexing the whole dictionar
y.
>
> If the above sentence is an dictionary entry and 'bec' is searched, this
> entry
> should be selelected too. Not only entries like 'bec....'.
>
> The dictionary is not and could not be sorted.
> To access each entry I use an index like: [entry; word].

In that case, then in order to see if any string matches a word in the
dictionary as a substring you need to decompress every such word every
time. You've traded space for time, and got a massive hit in the time
department.

> My idea is to remove all redundancy, stores redundant sequences uniquly in
to
> an index table, exchange each redundant sequence by its index, build the
> huffman compression table and compress each entry.
>
> But the big question is: Does this make sence?
>
> Because I am not an expert, I don't know how huffman is working with the
> resulting large index entries and the resulting worse (or I'm wrong?) weig
ht
> distribution.
>
> example:
>
>     the auto
>     the house
>     the automobile
>
>   would result in:
>
>     [1]
>     [0]house
>     [1]mobile

And a search for 'he au' would fail, as that substring does not appear in
the two records after you've replaced the sequence 'the auto' with a single
token.

>   weights:
>     the auohsmbil[0][1]
>
>   index:
>     [0] -> 'the '
>     [1] -> [0]auto
>
>   compress:
>     {0}:'the '
>     {1}:[0]auto
>     [1]
>     [0]house
>     [1]mobile
>
>   index entries would be randomly accessable too.
>
>
> Would be quite hard work.
> Anybody knows if it could increase compression dramatically
> or where I could find this algorithm explained or implemented?

Replacement of common substrings has been tried in the past with varying
levels of success. It's quite closely related to how LZW works, but you
can work out your fixed-size dictionary in advance.

Phil
--
They no longer do my traditional winks tournament lunch - liver and bacon.
It's just what you need during a winks tournament lunchtime to replace lost
... liver.   -- Anthony Horton, 2004/08/27 at the Cambridge 'Long Vac.'

Report this thread to moderator Post Follow-up to this message
Old Post
Phil Carmody
09-12-04 08:55 AM


Re: dictionary compression
"Fernando Lagos" <lagos_fernando@hotmail.com> schrieb im Newsbeitrag
news:67198833.0409111650.c67e703@posting.google.com...
> There is out there in internet, a dictionary implementation made by
> "Jesus Carreteros". I found it some years ago when I had to code a
> dictionary for some company.
>
> Perhaps it's a good start to see how Jesus breaks words in a very
> clever way, so there is no redundancy characters.

Thanks Fernando. That would help very much.
But I can't find anything.
Do you have more information about that?

regards
Wolfgang



Report this thread to moderator Post Follow-up to this message
Old Post
Wolfgang W.
09-12-04 01:55 PM


Re: dictionary compression
> > There is out there in internet, a dictionary implementation made by 
>
> Thanks Fernando. That would help very much.
> But I can't find anything.
> Do you have more information about that?


Ups!! There is nothing now... that's strange... Perhaps I have a name
mistake. I'll try to find my old project.

Report this thread to moderator Post Follow-up to this message
Old Post
Fernando Lagos
09-13-04 08:55 PM


Re: dictionary compression
> Ups!! There is nothing now... that's strange... Perhaps I have a name
> mistake. I'll try to find my old project.

Is "Jesus Carretero". Here you got some links:

ftp.fi.upm.es/pub/unix/espa%7Enol.tar.gz
http://www.arcos.inf.uc3m.es/~jcarrete/
http://sourceforge.net/project/show...lease_id=103216

Hope this help.
Regards.

Report this thread to moderator Post Follow-up to this message
Old Post
Fernando Lagos
09-14-04 01:55 PM


Sponsored Links




Last Thread Next Thread Next
Search this forum -> 
Post New Thread

Compression archive

Show a Printable Version Send to friend Email This Page to Someone! subscribe to this thread Receive updates to this thread
Computer Consultants
Programming Jobs
Visual Basic Controls
SQL Server Programming
Webservices
Java Security
Visual Studio
C# Programming
Visual J++
Software engineering
Open source Software
Perl Programming
PHP Programming
ASP Programming
ASP .NET Programming
Visual Basic Programming
Windows Scripting Host
Java Programming
Java Help
Java Beans
VBScript
Cobol
MAC Applications
Unix Programming
Forum Jump:
All times are GMT. The time now is 05:09 PM.

 
Free MCSE Braindumps | Real Estate Topics

Programming forum archive

Copyrights CodeComments.com 2004 - 2006

Powered by vBulletin Copyright 2000-2006 Jelsoft Enterprises Limited.