For Programmers: Free Programming Magazines  


Home > Archive > Compression > June 2007 > Compressing List of Tuples









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Compressing List of Tuples
bullockbefriending bard

2007-05-28, 7:56 am

Perhaps this is an unusual problem. I am hoping that it is not, and
that someone who frequents this list can make some suggestions.

I have a requirement to compress large lists of 6 integer element
tuples which look like this:

1,5,11,1,2,2
1,5,11,1,2,9
1,5,11,1,4,2
1,5,11,1,4,5
1,5,11,1,4,9
1,5,11,1,5,1

Individual field in the tuples have values between 1 and 14. A typical
list might contain 50,000+ unique tuples with no list ever containing
duplicate tuples.

The catch is that I *must* use a specific (see below) compression
format. The real nature of my problem is trying to maximise the speed
and compression factor.

Compressing the above list using the required format gives:

1,5,11,1,2+4,2 can be expanded to recover original tuples
1,5,11,1,2,2 and 1,5,11,1,4,2
1,5,11,1,2+4,9 can be expanded to recover original tuples
1,5,11,1,2,9 and 1,5,11,1,4,9
1,5,11,1,4,5 could not be 'merged' with another tuple in the
input list
1,5,11,1,5,1 could not be 'merged with another tuple in the
input list

compressed tuples may contain multiple terms in each of the 6
postions. e.g.

1+2,1+2,1+2,1+2,1+2,1+2 is a compressed representation of 64
individual tuples:

1,1,1,1,1,1
1,1,1,1,1,2
1,1,1,1,2,1
1,1,1,1,2,2
..
..
2,2,2,2,2,2

it does not make sense to repeat an integer within an individual field
of a tuple: 1+2+2,1,1,1,1,1 is illegal. 1+2+14,1+2,1,1,1,1+14 is
legal.

the overriding objective is to reduce the number of *lines* in a file
of input tuples as much as possible.

anyway, it is trivial, given a list of compressed tuples, to expand
them to get the original list of tuples with no multiple terms in each
field.

what is not so trivial, is to determine an optimal list of compressed
tuples which exactly covers the original list of uncompressed tuples.

i am aware that this is analogous to the set cover problem, and also
aware of a graph matching approach which looks good on the back of an
envelope, but which is perhaps 'too hard' in real life to solve.
Regarding a set cover approach, there is also a problem in that the
set of legal compressed tuples is HUGE (consider how many possible
ways there are to create compressed tuples with between one and
fourteen elements in each of six fields:)). anyway, NP-hard either
way... so unless i'm mistaken, we're better off looking at heuristics.

i have a naive greedy solution which achieves (on average) 10x
reduction in the number of lines of tuples. however, i believe that
one should be able to do better than that.

i'd be very interested to hear any new ideas - especially ideas about
how i might prune the solution space.

PS: i should reiterate - the compression format i use is non-
negotiable because the black box system which consumes the end-result
data can accept only either uncompressed tuples or tuples compressed
using the format i described above.

Dajo21

2007-06-07, 3:39 pm

HaCkeD aDuLT SiTe :)
Direct access to member zone
http://uniqueadult.com/members/video.php?file=1
username: 218571
password: wanttocome
change the number in the link to get other videos! There are gigs of them!
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com