For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2005 > Short String Compression









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Short String Compression
moogie

2005-01-31, 8:55 pm

I am wondering whether there is any patent-free compression technique
suitable for use on short text strings? I have programmed a file
server catelouger however i am running into memory issues storing the
uncompressed names of each file. There will be many many short strings
(more than 200000 and potentially millions)

If there is no suitable short string compressor available, i was
thinking of using a normal text compressor: instead of having lots of
small strings, concatinate all the small strings (filenames) together
and use a general purpose compressor such as gzip or bzip2. recording
the offset of each individual name. This means that i will need to be
able to s to a particular starting byte in this compressed
concatenated string which i do not believe gzip or bzip2 provides
functionality for.

I am programming in java, so if anyone knows of an easily random
access text compressor implemented in java please let me know.

thanks.
Phil Carmody

2005-01-31, 8:55 pm

budgetanime@mystarship.com (moogie) writes:
> I am wondering whether there is any patent-free compression technique
> suitable for use on short text strings? I have programmed a file
> server catelouger however i am running into memory issues storing the
> uncompressed names of each file. There will be many many short strings
> (more than 200000 and potentially millions)
>
> If there is no suitable short string compressor available, i was
> thinking of using a normal text compressor: instead of having lots of
> small strings, concatinate all the small strings (filenames) together
> and use a general purpose compressor such as gzip or bzip2. recording
> the offset of each individual name. This means that i will need to be
> able to s to a particular starting byte in this compressed
> concatenated string which i do not believe gzip or bzip2 provides
> functionality for.


Indeed it doesn't.

If you're storing full paths for filenames, then create one "dictionary"
of directory names, and then compress each full-pathed filename as an
index into that dictionary followed by the actual file's name.

For random access, you probably won't do much better than a fixed-model
(static) Huffman encoding. (An adaptive scheme with a fixed initial
model would be better, but more complicated.)

> I am programming in java, so if anyone knows of an easily random
> access text compressor implemented in java please let me know.


Huffman's easy, you'll find many examples on the web in languages
not dissimilar from Java, and maybe a few in Java too.

Phil
--
If a religion' is defined to be a system of ideas that contains unprovable
statements, then Godel taught us that mathematics is not only a religion, it
is the only religion that can prove itself to be one. -- John Barrow
cr88192

2005-02-01, 3:56 am


"moogie" <budgetanime@mystarship.com> wrote in message
news:e353ade.0501310530.77f64127@posting.google.com...
>I am wondering whether there is any patent-free compression technique
> suitable for use on short text strings? I have programmed a file
> server catelouger however i am running into memory issues storing the
> uncompressed names of each file. There will be many many short strings
> (more than 200000 and potentially millions)
>
> If there is no suitable short string compressor available, i was
> thinking of using a normal text compressor: instead of having lots of
> small strings, concatinate all the small strings (filenames) together
> and use a general purpose compressor such as gzip or bzip2. recording
> the offset of each individual name. This means that i will need to be
> able to s to a particular starting byte in this compressed
> concatenated string which i do not believe gzip or bzip2 provides
> functionality for.
>
> I am programming in java, so if anyone knows of an easily random
> access text compressor implemented in java please let me know.
>

is it possible to sort them and then store each as the number of characters
kept from the previous and the remainder of the string?

foo 0foo
foobar 3bar
foobarbaz 6baz
bar 0bar
barbaz 3baz
....

random access is a little harder though, unless one occasionally encodes
full strings, and works by regenerating strings up to the needed string from
the last full one.

or such...



bybell@rocketmail.com

2005-02-02, 3:55 am

> foo 0foo
> foobar 3bar
> foobarbaz 6baz
> bar 0bar
> barbaz 3baz


For something like this you might want to separate the prefix lengths
out to a separate section depending on your data. I actually used a
scheme similar to this to store hierarchical signal names for
VHDL/Verilog in a database. This way you'll get long clusters of
similar values across consecutive data items if you recompress it.

Note that if you recompress the data with something like gzip/bzip2 (at
least for me), the compression isn't as good as compressing raw
strings. However, if you're doing searches, the prefix thing wins
hands down--especially if you "checkpoint" full strings every so often
to reduce the search latency. bsearch across the checkpointed strings
and sweep forward with decompression on the checkpointed string that is
the max value less than [or equal to] your target.

-t

cr88192

2005-02-02, 8:55 am


<bybell@rocketmail.com> wrote in message
news:1107320088.111389.99160@c13g2000cwb.googlegroups.com...
>
> For something like this you might want to separate the prefix lengths
> out to a separate section depending on your data. I actually used a
> scheme similar to this to store hierarchical signal names for
> VHDL/Verilog in a database. This way you'll get long clusters of
> similar values across consecutive data items if you recompress it.
>
> Note that if you recompress the data with something like gzip/bzip2 (at
> least for me), the compression isn't as good as compressing raw
> strings. However, if you're doing searches, the prefix thing wins
> hands down--especially if you "checkpoint" full strings every so often
> to reduce the search latency. bsearch across the checkpointed strings
> and sweep forward with decompression on the checkpointed string that is
> the max value less than [or equal to] your target.
>

yes, I do something similar for "packing lists", even though it is not
strictly necessary, it somewhat cuts down on both file size and line length.
in my case though I am using 2 hex chars for the length.

for many things I use fast but often not terribly great compression.

or such...



Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com