For Programmers: Free Programming Magazines  


Home > Archive > Compression > February 2005 > Short String Compression









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Short String Compression
moogie

2005-02-01, 8:55 am

I am only storing the filename itself, the path of the file not stored
and is constructed from the location of the file in a directory tree.
I am saving considerable memory doing this.

At the moment i have settled on a static Huffman compression of the
bytes making up the unicode string storing a filename. It does reduce
the memory needed, however it significantly increases the time taken
to perform a search. I have had to make a comprimise by only
compressing filenames which achieve a threashold compression of bytes.
This compromise seems to be acceptable as it halved the search time
(As compared to all filenames compressed) and only moderately
increased memory usage.

Unless someone knows of an easily impelmented short string compressor
i guess this will do.
Tim Arheit

2005-02-01, 3:55 pm

On 1 Feb 2005 02:37:28 -0800, budgetanime@mystarship.com (moogie)
wrote:
>Unless someone knows of an easily impelmented short string compressor
>i guess this will do.


If the filenames are simple have some pattern in them. (ie. english
words), and have unused characters (ie. filenames may be 7-bit), then
using a simple dictionary lookup for the most common 2 or 3 byte
substrings and storing them using the unused 8-bit characters may
help.

As for searching. If you are looking for exact matches only, sort
your database based on the compressed filenames (doesn't matter by
which compression). Then when you do a search, compress the string
you are searching for, then look for the compressed filename rather
than decompressing each filename examined while searching.

-Tim
bybell@rocketmail.com

2005-02-02, 3:55 am

Considering he's using static huffman, he probably could hash the data
across n number of buckets and have n pointers into where the data
starts. (i.e., each bucket is simply a section of static huffman
encoded data whose filename hashes to the value for that section.)

If n isn't ridiculously large compared to how many data items there
are, it should speed up his search times without much impact on his
storage size.

-t

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com