| bullockbefriending bard 2007-05-28, 7:56 am |
| Perhaps this is an unusual problem. I am hoping that it is not, and
that someone who frequents this list can make some suggestions.
I have a requirement to compress large lists of 6 integer element
tuples which look like this:
1,5,11,1,2,2
1,5,11,1,2,9
1,5,11,1,4,2
1,5,11,1,4,5
1,5,11,1,4,9
1,5,11,1,5,1
Individual field in the tuples have values between 1 and 14. A typical
list might contain 50,000+ unique tuples with no list ever containing
duplicate tuples.
The catch is that I *must* use a specific (see below) compression
format. The real nature of my problem is trying to maximise the speed
and compression factor.
Compressing the above list using the required format gives:
1,5,11,1,2+4,2 can be expanded to recover original tuples
1,5,11,1,2,2 and 1,5,11,1,4,2
1,5,11,1,2+4,9 can be expanded to recover original tuples
1,5,11,1,2,9 and 1,5,11,1,4,9
1,5,11,1,4,5 could not be 'merged' with another tuple in the
input list
1,5,11,1,5,1 could not be 'merged with another tuple in the
input list
compressed tuples may contain multiple terms in each of the 6
postions. e.g.
1+2,1+2,1+2,1+2,1+2,1+2 is a compressed representation of 64
individual tuples:
1,1,1,1,1,1
1,1,1,1,1,2
1,1,1,1,2,1
1,1,1,1,2,2
..
..
2,2,2,2,2,2
it does not make sense to repeat an integer within an individual field
of a tuple: 1+2+2,1,1,1,1,1 is illegal. 1+2+14,1+2,1,1,1,1+14 is
legal.
the overriding objective is to reduce the number of *lines* in a file
of input tuples as much as possible.
anyway, it is trivial, given a list of compressed tuples, to expand
them to get the original list of tuples with no multiple terms in each
field.
what is not so trivial, is to determine an optimal list of compressed
tuples which exactly covers the original list of uncompressed tuples.
i am aware that this is analogous to the set cover problem, and also
aware of a graph matching approach which looks good on the back of an
envelope, but which is perhaps 'too hard' in real life to solve.
Regarding a set cover approach, there is also a problem in that the
set of legal compressed tuples is HUGE (consider how many possible
ways there are to create compressed tuples with between one and
fourteen elements in each of six fields:)). anyway, NP-hard either
way... so unless i'm mistaken, we're better off looking at heuristics.
i have a naive greedy solution which achieves (on average) 10x
reduction in the number of lines of tuples. however, i believe that
one should be able to do better than that.
i'd be very interested to hear any new ideas - especially ideas about
how i might prune the solution space.
PS: i should reiterate - the compression format i use is non-
negotiable because the black box system which consumes the end-result
data can accept only either uncompressed tuples or tuples compressed
using the format i described above.
|