Home > Archive > Compression > May 2007 > Compressing List of Tuples
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Compressing List of Tuples
|
|
| bullockbefriending bard 2007-05-28, 7:56 am |
| Perhaps this is an unusual problem. I am hoping that it is not, and
that someone who frequents this list can make some suggestions.
I have a requirement to compress large lists of 6 integer element
tuples which look like this:
1,5,11,1,2,2
1,5,11,1,2,9
1,5,11,1,4,2
1,5,11,1,4,5
1,5,11,1,4,9
1,5,11,1,5,1
Individual field in the tuples have values between 1 and 14. A typical
list might contain 50,000+ unique tuples with no list ever containing
duplicate tuples.
The catch is that I *must* use a specific (see below) compression
format. The real nature of my problem is trying to maximise the speed
and compression factor.
Compressing the above list using the required format gives:
1,5,11,1,2+4,2 can be expanded to recover original tuples
1,5,11,1,2,2 and 1,5,11,1,4,2
1,5,11,1,2+4,9 can be expanded to recover original tuples
1,5,11,1,2,9 and 1,5,11,1,4,9
1,5,11,1,4,5 could not be 'merged' with another tuple in the
input list
1,5,11,1,5,1 could not be 'merged with another tuple in the
input list
compressed tuples may contain multiple terms in each of the 6
postions. e.g.
1+2,1+2,1+2,1+2,1+2,1+2 is a compressed representation of 64
individual tuples:
1,1,1,1,1,1
1,1,1,1,1,2
1,1,1,1,2,1
1,1,1,1,2,2
..
..
2,2,2,2,2,2
it does not make sense to repeat an integer within an individual field
of a tuple: 1+2+2,1,1,1,1,1 is illegal. 1+2+14,1+2,1,1,1,1+14 is
legal.
the overriding objective is to reduce the number of *lines* in a file
of input tuples as much as possible.
anyway, it is trivial, given a list of compressed tuples, to expand
them to get the original list of tuples with no multiple terms in each
field.
what is not so trivial, is to determine an optimal list of compressed
tuples which exactly covers the original list of uncompressed tuples.
i am aware that this is analogous to the set cover problem, and also
aware of a graph matching approach which looks good on the back of an
envelope, but which is perhaps 'too hard' in real life to solve.
Regarding a set cover approach, there is also a problem in that the
set of legal compressed tuples is HUGE (consider how many possible
ways there are to create compressed tuples with between one and
fourteen elements in each of six fields:)). anyway, NP-hard either
way... so unless i'm mistaken, we're better off looking at heuristics.
i have a naive greedy solution which achieves (on average) 10x
reduction in the number of lines of tuples. however, i believe that
one should be able to do better than that.
i'd be very interested to hear any new ideas - especially ideas about
how i might prune the solution space.
PS: i should reiterate - the compression format i use is non-
negotiable because the black box system which consumes the end-result
data can accept only either uncompressed tuples or tuples compressed
using the format i described above.
| |
| Josiah Carlson 2007-05-28, 9:55 pm |
| bullockbefriending bard wrote:
> Perhaps this is an unusual problem. I am hoping that it is not, and
> that someone who frequents this list can make some suggestions.
>
> I have a requirement to compress large lists of 6 integer element
> tuples which look like this:
>
> 1,5,11,1,2,2
> 1,5,11,1,2,9
> 1,5,11,1,4,2
> 1,5,11,1,4,5
> 1,5,11,1,4,9
> 1,5,11,1,5,1
>
> Individual field in the tuples have values between 1 and 14. A typical
> list might contain 50,000+ unique tuples with no list ever containing
> duplicate tuples.
A *very* similar question to this was posted a couple w s ago...
http://groups.google.com/group/comp...7ab288ec2f50f0a
The algorithm I provided in that thread relies on certain features of
the permutations of 3 elements to merge groupings of 3-tuples. If you
were to iterate over the 720 permutations of your 6-tuples, using the
(very simple) matching algorithm I provided, you may be able to get
fairly decent results.
- Josiah
| |
| bullockbefriending bard 2007-05-29, 3:56 am |
| excellent thread! thanks for the pointer. to reiterate what somebody
said in that thread, usenet is still a Good Thing.
On May 29, 4:51 am, Josiah Carlson <josiah.carl...@sbcglobal.net>
wrote:
> bullockbefriending bard wrote:
>
>
>
>
> A *very* similar question to this was posted a couple w s ago...
>
> http://groups.google.com/group/comp...read/thread/...
>
> The algorithm I provided in that thread relies on certain features of
> the permutations of 3 elements to merge groupings of 3-tuples. If you
> were to iterate over the 720 permutations of your 6-tuples, using the
> (very simple) matching algorithm I provided, you may be able to get
> fairly decent results.
>
> - Josiah
| |
| bullockbefriending bard 2007-05-29, 7:56 am |
| PS: apropos of nothing except proof that there's a lot less than 6
degrees of freedom once you start focusing on a problem domain and
googling - i'd already bookmarked your supervisor's page as one of
several people most likely in the multipartite matching department +
had filed away your ASPN python union-find submission.
> On May 29, 4:51 am, Josiah Carlson <josiah.carl...@sbcglobal.net> wrote:
> bullockbefriending bard wrote:
>
>
>
>
> A *very* similar question to this was posted a couple w s ago...
>
> http://groups.google.com/group/comp...read/thread/...
>
> The algorithm I provided in that thread relies on certain features of
> the permutations of 3 elements to merge groupings of 3-tuples. If you
> were to iterate over the 720 permutations of your 6-tuples, using the
> (very simple) matching algorithm I provided, you may be able to get
> fairly decent results.
>
> - Josiah
| |
| bullockbefriending bard 2007-05-30, 9:56 pm |
| Once again thanks for pointing me towards this very useful thread.
I've taken the trouble to modify your 3-tuple example to more closely
suit my 6-tuple problem.
You will see that my modified method (perhaps naively) only performs
one rotation through the 6 columns as it does the greedy merge. I'm
afraid I don't really understand your reference to permutations. Well,
I *do* know that there are 6 ways to order a the elements of a 3-tuple
and 720 ways to order the elements of a 6-tuple... but I'm not really
clear on why this should be relevant given that the innermost loop in
your example compares 5 elements of two consecutive tuples in order to
see if a merge condition exists in element 6 - I just can't see why
one would need to care about order of taking these?
There must be something I'm just not getting here. My modified method
achieves 'reasonable' (>20x) compression on a file with 140K+ tuples
(needless to say, these are not random and there is obviously some
hint of underlying structure). However, assuming I *have* somehow
missed part of the point of your algorithm, I'd really appreciate it
if you could point it out to me, as I certainly wouldn't mind
sacrificing longer run time for greater compression.
(In addition, as far as I can tell, your 3-tuple code and my above
method in 3-tuple form, perform identically on input data sets... )
Many thanks,
Peter
Following are your version from the thread you pointed me to + what
seems to work ~OK for me.
def foo(lolol):
#lolol for a list of lists of lists
for i in xrange(2):
for j in xrange(3):
lolol.sort()
k = 1
while k < len(lolol):
if lolol[k-1][:2] == lolol[k][:2]:
lolol[k-1][2].extend(lolol.pop(k)[2])
lolol[k-1][2].sort()
else:
k += 1
#rotate!
for k in lolol:
k.append(k.pop(0))
#reverse!
for k in lolol:
k.reverse()
return lolol
def combination_compression_algorithm_(combi
nations):
"""Greedily compress all combinations on one field, then next field,
etc.
Parameter combinations is structured so:
[
[[1],[2],[3],[4],[5],[6]],
[[1],[2],[3],[4],[5],[7]],
]
Given the above input, returned combinations should be:
[
[[1],[2],[3],[4],[5],[6,7]]
]
"""
for do_it_six_times in xrange(6):
# we want everthing to be in lexically ascending order of
# combinations: [[1],[2],[3],[4],[5],[6]] < [[1],[2],[3],[4],[5],
[7]]
# < [[1],[2],[3],[5],[14],[14]]
combinations.sort()
k = 1
while k < len(combinations):
# compare fields 0-4 of two consecutive combinations, e.g.
# [[1],[2],[3],[4],[5],[6]], [[1],[2],[3],[4],[5],[7]]
if combinations[k-1][:5] == combinations[k][:5]:
# [[1],[2],[3],[4],[5]] == [[1],[2],[3],[4],[5]] =>
# we append the elements of combinations[k][5]
# to the equivalent sublist in the preceding combination, we
# can we can get rid of combinations[k] and
# combinations[k-1] becomes [[1],[2],[3],[4],[5],[6,7]]
# surprisingly combinations.pop(k) doesn't seem to result
# in slow run times for lists of 100K+ combinations.
combinations[k-1][5].extend(combinations.pop(k)[5])
# following line seems not necessary since top of outer loop
# combinations.sort() puts everything into lexically ascending
# order such that above line will always maintain correct
# element order in combinations[k-1][5]
#combinations[k-1][5].sort()
else:
# greedy algorithm. we keep merging into combinations[k-1][5]
# until we run out of matches. only then do we increment k.
k += 1
# rotate combinations left so that (e.g.)
# combination [[1],[2],[3],[4],[5],[6]] => [[2],[3],[4],[5],[6],[1]]
# the outer loop has us do this 6 times, so that upon termination
# of outer loop, applied the combination[5] merge of the inner loop
# to each of the 6 elements of combination[] and each combination is
# returned to original starting order of elements.
for combi in combinations:
combi.append(combi.pop(0))
return combinations
> A *very* similar question to this was posted a couple w s ago...
>
> http://groups.google.com/group/comp...read/thread/...
>
> The algorithm I provided in that thread relies on certain features of
> the permutations of 3 elements to merge groupings of 3-tuples. If you
> were to iterate over the 720 permutations of your 6-tuples, using the
> (very simple) matching algorithm I provided, you may be able to get
> fairly decent results.
>
> - Josiah
| |
| Josiah Carlson 2007-05-30, 9:56 pm |
| bullockbefriending bard wrote:
> Once again thanks for pointing me towards this very useful thread.
> I've taken the trouble to modify your 3-tuple example to more closely
> suit my 6-tuple problem.
>
> You will see that my modified method (perhaps naively) only performs
> one rotation through the 6 columns as it does the greedy merge. I'm
> afraid I don't really understand your reference to permutations. Well,
> I *do* know that there are 6 ways to order a the elements of a 3-tuple
> and 720 ways to order the elements of a 6-tuple... but I'm not really
> clear on why this should be relevant given that the innermost loop in
> your example compares 5 elements of two consecutive tuples in order to
> see if a merge condition exists in element 6 - I just can't see why
> one would need to care about order of taking these?
>
> There must be something I'm just not getting here. My modified method
> achieves 'reasonable' (>20x) compression on a file with 140K+ tuples
> (needless to say, these are not random and there is obviously some
> hint of underlying structure). However, assuming I *have* somehow
> missed part of the point of your algorithm, I'd really appreciate it
> if you could point it out to me, as I certainly wouldn't mind
> sacrificing longer run time for greater compression.
I could have sworn that permuting the elements in all possible ways
would allow for more potential matching, but through a bit of testing
myself, it doesn't seem to make any difference.
What *does* make a difference is how the columns are permuted initially,
because as you choose a particular matching, you remove a different set
of choices for the matching. In a quick test among 262,144 6-tuples of
values randomly from 0-15, it was able to reduce it down to 180,475
elements, but randomly permuting the columns and doing it again produced
180,418 elements. Another run got me 180,475 and 180,489 elements.
So, if you have some time, you can permute the columns in all possible
ways, running each one of them through the algorithm, then
reverse-permute the best result. Or you can randomly permute the
columns a few times, running each through the algorithm, and pick the
best one.
- Josiah
| |
| bullockbefriending bard 2007-05-31, 7:55 am |
| > I could have sworn that permuting the elements in all possible ways
> would allow for more potential matching, but through a bit of testing
> myself, it doesn't seem to make any difference.
glad to hear you find the same thing. i don't have a CS background and
basically don't even know what i don't know, so your confirmation is a
relief. one day i must stop using the copy of cormen i bought from
amazon as a doorstopper and start at page 1.
> What *does* make a difference is how the columns are permuted initially,
> because as you choose a particular matching, you remove a different set
> of choices for the matching. In a quick test among 262,144 6-tuples of
> values randomly from 0-15, it was able to reduce it down to 180,475
> elements, but randomly permuting the columns and doing it again produced
> 180,418 elements. Another run got me 180,475 and 180,489 elements.
quite so. based on some earlier suggestions i had constructed a
modified hamming graph linking tuples which differed in only one field
and with the aid of graphviz 'discovered' partial cubes (although
didn't know their name until your recent post) and then eventually
realised that just because i could see structures on the screen didn't
mean that i had a hope in hell of coming up with an algorithm to
extract them :). i'm going to go away and think if there is some smart
way i could at least extract some information from some graph
structure which tells me something about how i should permute columns
to achieve 'optimal' compression.
> So, if you have some time, you can permute the columns in all possible
> ways, running each one of them through the algorithm, then
> reverse-permute the best result. Or you can randomly permute the
> columns a few times, running each through the algorithm, and pick the
> best one.
estimate something like 2.8 hours runtime in python to do 720x on
extremal datasets, but may well be fast enough in STL land.
unfortunately the underlying process is time critical and 10-20 mins
would be max acceptable runtime.
one again thanks for giving me food for thought.
| |
| Josiah Carlson 2007-05-31, 6:56 pm |
| bullockbefriending bard wrote:
>
> glad to hear you find the same thing. i don't have a CS background and
> basically don't even know what i don't know, so your confirmation is a
> relief. one day i must stop using the copy of cormen i bought from
> amazon as a doorstopper and start at page 1.
CLRS is tough to get into. Goodrich and Tamassia have a much more
readable intro to algorithms book, though it isn't nearly as complete.
>
> quite so. based on some earlier suggestions i had constructed a
> modified hamming graph linking tuples which differed in only one field
> and with the aid of graphviz 'discovered' partial cubes (although
> didn't know their name until your recent post) and then eventually
> realised that just because i could see structures on the screen didn't
> mean that i had a hope in hell of coming up with an algorithm to
> extract them :). i'm going to go away and think if there is some smart
> way i could at least extract some information from some graph
> structure which tells me something about how i should permute columns
> to achieve 'optimal' compression.
If you have a reasonable method of going from your tuples to bit
strings, using the method of "Approximate Dictionary Queries" by Gerth
Stolting Brodal and Leszek Gasieniec from 1996, you can generate all
pairs of bit strings that differ by one bit in time proportional to the
total number of bits passed.*
>
> estimate something like 2.8 hours runtime in python to do 720x on
> extremal datasets, but may well be fast enough in STL land.
> unfortunately the underlying process is time critical and 10-20 mins
> would be max acceptable runtime.
So run it for 10 minutes and choose the best result :) That would get
you a solid 40 runs or so.
Using a bit of sloppy math related to the harmonic numbers, after about
40 runs, it is probable that after choosing the first, you would have
only found better results about 5 or 6 times, and you would only get
better results from the remaining 680 possible permutations 5 or 6 times
more (basically, the chance of improving a previous result after having
tried i permutations already is 1/i, assuming a random permutation as
input).
> one again thanks for giving me food for thought.
No problem. I'm glad to help.
- Josiah
* Incidentally, I do have a method that does the same thing, is
significantly easier to understand and implement, but isn't an online
algorithm; you need to give it all of the bit vectors up front. It's
not yet published, but if you need this operation and find the
Brodal/Gasieneic algorithm to be too difficult to implement, I could
likely be convinced to help.
|
|
|
|
|