Home > Archive > Compression > May 2007 > normalize/compress database algorithm
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
normalize/compress database algorithm
|
|
| wilsotc@gmail.com 2007-05-10, 6:58 pm |
| I have a table with records in the format(1):
int field1, int field 2, int field 3
I would like an algorithm to convert this table to the minimum number
of records possible for the format(2):
int field1[], int field2[], int field3[]
such that records for format(1) could be derived from format(2)
for example, the records:
1,2,3
1,3,3
1,1,7
1,2,4
1,3,4
2,4,4
2,2,4
3,5,3
4,5,3
would become:
{1},{2,3},{3,4}
{1},{1},{7}
{2},{2,4},{4}
{3,4},{5},{3}
This may be an np-hard problem. This should be a generally applicable
algorithm to database conversions, but I can't seem to find anything.
Maybe I'm searching with the wrong terminology?
| |
| mensanator@aol.com 2007-05-11, 3:56 am |
| On May 10, 11:25=EF=BF=BDam, wils...@gmail.com wrote:
> I have a table with records in the format(1):
> int field1, int field 2, int field 3
>
> I would like an algorithm to convert this table to the minimum number
> of records possible for the format(2):
> int field1[], int field2[], int field3[]
>
> such that records for format(1) could be derived from format(2)
>
> for example, the records:
> 1,2,3
> 1,3,3
> 1,1,7
> 1,2,4
> 1,3,4
> 2,4,4
> 2,2,4
> 3,5,3
> 4,5,3
> would become:
> {1},{2,3},{3,4}
> {1},{1},{7}
> {2},{2,4},{4}
> {3,4},{5},{3}
>
> This may be an np-hard problem. =A0This should be a generally applicable
> algorithm to database conversions, but I can't seem to find anything.
> Maybe I'm searching with the wrong terminology?
Here's my Python attempt (with some more records added).
def is_valid(a,b,j):
# j is the one to merge,
# so other two have to be equal
if j=3D=3D0:
if a[1]=3D=3Db[1] and a[2]=3D=3Db[2]:
return True
elif j=3D=3D1:
if a[0]=3D=3Db[0] and a[2]=3D=3Db[2]:
return True
elif j=3D=3D2:
if a[1]=3D=3Db[1] and a[0]=3D=3Db[0]:
return True
return False
def merge(target,source):
dispatched =3D False
tlen =3D len(target)
i =3D 0
merge_to =3D []
merge_count =3D 0
status =3D 0
while i<tlen and not dispatched:
status =3D 0
for j,k in enumerate(source):
if k.issubset(target[i][j]):
# subset test checks if this combo
# is completely covered
status +=3D 1
if status=3D=3D0 or status=3D=3D1:
dispatched =3D False # check next one
elif status=3D=3D2: # a merge candidate...
dispatched =3D False # ...but we might not use it
merge_to.append(i) # merge candidates
elif status=3D=3D3: # it's covered, can discard
dispatched =3D True
i +=3D 1
if not dispatched:
# from list of merge candidates,
# only merge if the non-merging
# cells are identicle!
merge_did =3D False
if merge_to: # not empty means merge it (maybe)
m =3D len(merge_to)
n =3D 0
while not merge_did and n<m:
i =3D merge_to[n]
for j,k in enumerate(source): # find non-
subset
if not k.issubset(target[i][j]):
if is_valid(target[i],source,j): # valid
merge?
target[i][j] =3D target[i][j].union(k)
merge_did =3D True
merge_count +=3D 1
n +=3D 1
if not merge_did:
target.append(source)
else:
target.append(source) # otherwise append it
return merge_count
t1 =3D [[set([1]),set([2]),set([3])], \
[set([1]),set([3]),set([3])], \
[set([1]),set([1]),set([7])], \
[set([1]),set([2]),set([4])], \
[set([1]),set([3]),set([4])], \
[set([2]),set([4]),set([4])], \
[set([2]),set([2]),set([4])], \
[set([3]),set([5]),set([3])], \
[set([4]),set([5]),set([3])], \
[set([1]),set([2]),set([7])], \
[set([1]),set([3]),set([7])], \
[set([2]),set([1]),set([7])], \
[set([2]),set([2]),set([7])], \
[set([2]),set([3]),set([7])], \
]
over =3D False
phase =3D 1
while not over:
print 'phase %d\n' % (phase)
t2 =3D []
for i in t1: print i
print
merge_total =3D 0
while t1:
merges_did =3D merge(t2,t1[-1])
t1.pop()
merge_total +=3D merges_did
print 'merges: %d\n' % (merge_total)
for i in t2: print i
if merge_total=3D=3D0:
over =3D True
t1 =3D t2[:]
print '=3D'*40
phase +=3D 1
## phase 1
##
## [set([1]), set([2]), set([3])]
## [set([1]), set([3]), set([3])]
## [set([1]), set([1]), set([7])]
## [set([1]), set([2]), set([4])]
## [set([1]), set([3]), set([4])]
## [set([2]), set([4]), set([4])]
## [set([2]), set([2]), set([4])]
## [set([3]), set([5]), set([3])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2]), set([7])]
## [set([1]), set([3]), set([7])]
## [set([2]), set([1]), set([7])]
## [set([2]), set([2]), set([7])]
## [set([2]), set([3]), set([7])]
##
## merges: 8
##
## [set([2]), set([1, 2, 3]), set([7])]
## [set([1]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
## =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D
## phase 2
##
## [set([2]), set([1, 2, 3]), set([7])]
## [set([1]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
##
## merges: 2
##
## [set([1]), set([2, 3]), set([3, 4])]
## [set([2]), set([2, 4]), set([4])]
## [set([3, 4]), set([5]), set([3])]
## [set([1, 2]), set([1, 2, 3]), set([7])]
## =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D
## phase 3
##
## [set([1]), set([2, 3]), set([3, 4])]
## [set([2]), set([2, 4]), set([4])]
## [set([3, 4]), set([5]), set([3])]
## [set([1, 2]), set([1, 2, 3]), set([7])]
##
## merges: 0
##
## [set([1, 2]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([3, 4])]
## =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
3D=3D=3D=3D
| |
| Josiah Carlson 2007-05-11, 9:55 pm |
| wilsotc@gmail.com wrote:
> I have a table with records in the format(1):
> int field1, int field 2, int field 3
>
> I would like an algorithm to convert this table to the minimum number
> of records possible for the format(2):
> int field1[], int field2[], int field3[]
>
> such that records for format(1) could be derived from format(2)
>
> for example, the records:
> 1,2,3
> 1,3,3
> 1,1,7
> 1,2,4
> 1,3,4
> 2,4,4
> 2,2,4
> 3,5,3
> 4,5,3
> would become:
> {1},{2,3},{3,4}
> {1},{1},{7}
> {2},{2,4},{4}
> {3,4},{5},{3}
>
> This may be an np-hard problem. This should be a generally applicable
> algorithm to database conversions, but I can't seem to find anything.
> Maybe I'm searching with the wrong terminology?
What you have described is exactly the problem of extracting
non-overlapping partial cubes from a x-partite graph.
That is to say, with 3 columns of data, you really have a tripartite
graph. Each column are labels of vertices in that column. When you
have a row, you are really linking a sequence of nodes from the first,
second, and third columns.
When you are trying to find your '{1},{2,3},{3,4}' patterns, you are
extracting a partial cube from the larger graph, continuing until there
are no more. The best compression you could get (I believe), is to
always choose the largest partial cubes first.
Now, recognizing a partial cube is O(n^2) time thanks to David
Eppstein's recent paper: http://arxiv.org/abs/0705.1025 , but I am
unaware of any algorithms for selecting partial cubes from a larger
graph. For that, one may need to speak with someone with more experience.
- Josiah
| |
| Josiah Carlson 2007-05-11, 9:55 pm |
| Josiah Carlson wrote:
[snip]
> What you have described is exactly the problem of extracting
> non-overlapping partial cubes from a x-partite graph.
>
> That is to say, with 3 columns of data, you really have a tripartite
> graph. Each column are labels of vertices in that column. When you
> have a row, you are really linking a sequence of nodes from the first,
> second, and third columns.
>
> When you are trying to find your '{1},{2,3},{3,4}' patterns, you are
> extracting a partial cube from the larger graph, continuing until there
> are no more. The best compression you could get (I believe), is to
> always choose the largest partial cubes first.
>
> Now, recognizing a partial cube is O(n^2) time thanks to David
> Eppstein's recent paper: http://arxiv.org/abs/0705.1025 , but I am
> unaware of any algorithms for selecting partial cubes from a larger
> graph. For that, one may need to speak with someone with more experience.
Thinking some more, only some cases are partial cubes. Other cases
could be handled pretty well by merely tossing the rows into a trie
(like you would a string), rotating by one column and doing it again,
etc. Possibly reverse... You could discover much of the redundant
structure in that way.
- Josiah
The failure of the partial cube technique is that given the three entries...
1,1,1
1,1,2
1,1,3
You would want to produce
{1},{1},{1,2,3}
But a partial cube technique would only produce each of them
individually. However, given...
1,2,3
1,2,4
1,3,3
1,3,4
A partial cube technique could produce
{1},{2,3},{3,4}
| |
| Josiah Carlson 2007-05-11, 9:55 pm |
| mensanator@aol.com wrote:
> On May 10, 11:25�am, wils...@gmail.com wrote:
>
> Here's my Python attempt (with some more records added).
Here's a variant of your method using lists...
The version of the algorithm I include below won't necessarily produce
the optimal results, but it will produce some valid result. I really
like it when these kinds of problems have really simple algorithms.
- Josiah
x =[[[1],[2],[3]],
[[1],[3],[3]],
[[1],[1],[7]],
[[1],[2],[4]],
[[1],[3],[4]],
[[2],[4],[4]],
[[2],[2],[4]],
[[3],[5],[3]],
[[4],[5],[3]]]
def foo(lolol):
#lolol for a list of lists of lists
for i in xrange(2):
for j in xrange(3):
lolol.sort()
k = 1
while k < len(lolol):
if lolol[k-1][:2] == lolol[k][:2]:
lolol[k-1][2].extend(lolol.pop(k)[2])
lolol[k-1][2].sort()
else:
k += 1
#rotate!
for k in lolol:
k.append(k.pop(0))
#reverse!
for k in lolol:
k.reverse()
return lolol
import pprint
pprint.pprint(foo(x))
[color=darkred]
[[[1], [2, 3], [3, 4]],
[[1], [1], [7]],
[[2], [2, 4], [4]],
[[3, 4], [5], [3]]][color=darkred]
| |
| wilsotc@gmail.com 2007-05-14, 9:55 pm |
| On May 10, 6:45 pm, "busy...@gmail.com" <busy...@gmail.com> wrote:[color=darkred]
> What you want is very related to semantic database compression. It's
> usually impossible to make it lossless and significantly reducing the
> size, but there is a bunch of research for lossy compression. You may
> start from reading this paper:http://dbpubs.stanford.edu:8090/pub/2001-22
>
> Hope this helps,
>
> Stashttp://www.busygin.dp.ua
>
> On May 10, 12:25 pm, wils...@gmail.com wrote:
>
>
>
>
>
While I do need the resultant records to be lossless, any lossy
techniques would be an excellent way to show anomalous data. Any
difference between the lossy compressed tables and the lossless
compressed table may be a data entry error, as the source of the flat
file was human input.
| |
| wilsotc@gmail.com 2007-05-14, 9:55 pm |
| On May 11, 2:15 am, "mensana...@aol.com" <mensana...@aol.com> wrote:
> On May 10, 11:25?am, wils...@gmail.com wrote:
>
>
>
>
>
>
>
>
> Here's my Python attempt (with some more records added).
>
> def is_valid(a,b,j):
> # j is the one to merge,
> # so other two have to be equal
> if j==0:
> if a[1]==b[1] and a[2]==b[2]:
> return True
> elif j==1:
> if a[0]==b[0] and a[2]==b[2]:
> return True
> elif j==2:
> if a[1]==b[1] and a[0]==b[0]:
> return True
> return False
>
> def merge(target,source):
> dispatched = False
> tlen = len(target)
> i = 0
> merge_to = []
> merge_count = 0
> status = 0
> while i<tlen and not dispatched:
> status = 0
> for j,k in enumerate(source):
> if k.issubset(target[i][j]):
> # subset test checks if this combo
> # is completely covered
> status += 1
> if status==0 or status==1:
> dispatched = False # check next one
> elif status==2: # a merge candidate...
> dispatched = False # ...but we might not use it
> merge_to.append(i) # merge candidates
> elif status==3: # it's covered, can discard
> dispatched = True
> i += 1
> if not dispatched:
> # from list of merge candidates,
> # only merge if the non-merging
> # cells are identicle!
> merge_did = False
> if merge_to: # not empty means merge it (maybe)
> m = len(merge_to)
> n = 0
> while not merge_did and n<m:
> i = merge_to[n]
> for j,k in enumerate(source): # find non-
> subset
> if not k.issubset(target[i][j]):
> if is_valid(target[i],source,j): # valid
> merge?
> target[i][j] = target[i][j].union(k)
> merge_did = True
> merge_count += 1
> n += 1
> if not merge_did:
> target.append(source)
> else:
> target.append(source) # otherwise append it
> return merge_count
>
> t1 = [[set([1]),set([2]),set([3])], \
> [set([1]),set([3]),set([3])], \
> [set([1]),set([1]),set([7])], \
> [set([1]),set([2]),set([4])], \
> [set([1]),set([3]),set([4])], \
> [set([2]),set([4]),set([4])], \
> [set([2]),set([2]),set([4])], \
> [set([3]),set([5]),set([3])], \
> [set([4]),set([5]),set([3])], \
> [set([1]),set([2]),set([7])], \
> [set([1]),set([3]),set([7])], \
> [set([2]),set([1]),set([7])], \
> [set([2]),set([2]),set([7])], \
> [set([2]),set([3]),set([7])], \
> ]
>
> over = False
>
> phase = 1
>
> while not over:
> print 'phase %d\n' % (phase)
> t2 = []
> for i in t1: print i
> print
> merge_total = 0
> while t1:
> merges_did = merge(t2,t1[-1])
> t1.pop()
> merge_total += merges_did
> print 'merges: %d\n' % (merge_total)
> for i in t2: print i
> if merge_total==0:
> over = True
> t1 = t2[:]
> print '='*40
> phase += 1
>
> ## phase 1
> ##
> ## [set([1]), set([2]), set([3])]
> ## [set([1]), set([3]), set([3])]
> ## [set([1]), set([1]), set([7])]
> ## [set([1]), set([2]), set([4])]
> ## [set([1]), set([3]), set([4])]
> ## [set([2]), set([4]), set([4])]
> ## [set([2]), set([2]), set([4])]
> ## [set([3]), set([5]), set([3])]
> ## [set([4]), set([5]), set([3])]
> ## [set([1]), set([2]), set([7])]
> ## [set([1]), set([3]), set([7])]
> ## [set([2]), set([1]), set([7])]
> ## [set([2]), set([2]), set([7])]
> ## [set([2]), set([3]), set([7])]
> ##
> ## merges: 8
> ##
> ## [set([2]), set([1, 2, 3]), set([7])]
> ## [set([1]), set([1, 2, 3]), set([7])]
> ## [set([3, 4]), set([5]), set([3])]
> ## [set([2]), set([2, 4]), set([4])]
> ## [set([1]), set([2, 3]), set([4])]
> ## [set([1]), set([2, 3]), set([3])]
> ## ========================================
> ## phase 2
> ##
> ## [set([2]), set([1, 2, 3]), set([7])]
> ## [set([1]), set([1, 2, 3]), set([7])]
> ## [set([3, 4]), set([5]), set([3])]
> ## [set([2]), set([2, 4]), set([4])]
> ## [set([1]), set([2, 3]), set([4])]
> ## [set([1]), set([2, 3]), set([3])]
> ##
> ## merges: 2
> ##
> ## [set([1]), set([2, 3]), set([3, 4])]
> ## [set([2]), set([2, 4]), set([4])]
> ## [set([3, 4]), set([5]), set([3])]
> ## [set([1, 2]), set([1, 2, 3]), set([7])]
> ## ========================================
> ## phase 3
> ##
> ## [set([1]), set([2, 3]), set([3, 4])]
> ## [set([2]), set([2, 4]), set([4])]
> ## [set([3, 4]), set([5]), set([3])]
> ## [set([1, 2]), set([1, 2, 3]), set([7])]
> ##
> ## merges: 0
> ##
> ## [set([1, 2]), set([1, 2, 3]), set([7])]
> ## [set([3, 4]), set([5]), set([3])]
> ## [set([2]), set([2, 4]), set([4])]
> ## [set([1]), set([2, 3]), set([3, 4])]
> ## ========================================
wow, and here I had given up on usenet! I've applied this code to my
data. 18000+ records are reduced to ~1400 in 5 "phases". It also
appears to compress by more than 50%. This code also seems to run in
something like O(n^2), because the 18000+ records with ranges of
[0-850],[0-2800],[0-15] for fields 1,2,3 respectively only take about
1 minute to finish.
| |
| wilsotc@gmail.com 2007-05-14, 9:55 pm |
| On May 14, 12:11 pm, wils...@gmail.com wrote:
> On May 11, 2:15 am, "mensana...@aol.com" <mensana...@aol.com> wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> wow, and here I had given up on usenet! I've applied this code to my
> data. 18000+ records are reduced to ~1400 in 5 "phases". It also
> appears to compress by more than 50%. This code also seems to run in
> something like O(n^2), because the 18000+ records with ranges of
> [0-850],[0-2800],[0-15] for fields 1,2,3 respectively only take about
> 1 minute to finish.
To be exact, the resultant set is reduced to a size 41% of the
original for a 59%reduction. The gzip compressed file of the results
file is also 59% smaller than the gzip compressed start file. So this
algorithm improves on gzip compression by "a lot" for this type of
data at the cost of many CPU cycles, but still, isn't this generally
applicable to file compression?
| |
| wilsotc@gmail.com 2007-05-14, 9:55 pm |
| On May 14, 12:54 pm, wils...@gmail.com wrote:
> On May 14, 12:11 pm, wils...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> To be exact, the resultant set is reduced to a size 41% of the
> original for a 59%reduction. The gzip compressed file of the results
> file is also 59% smaller than the gzip compressed start file. So this
> algorithm improves on gzip compression by "a lot" for this type of
> data at the cost of many CPU cycles, but still, isn't this generally
> applicable to file compression?
This would be an excellent way of quantifying redundant information in
tabular data. I know that my flat table is at least 59% redundant.
Perhaps more of it is, I'm not sure the 1412 records are the minimum
number possible.
| |
| mensanator@aol.com 2007-05-14, 9:55 pm |
| On May 14, 12:20 pm, wils...@gmail.com wrote:
> On May 14, 12:54 pm, wils...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> This would be an excellent way of quantifying redundant information in
> tabular data. I know that my flat table is at least 59% redundant.
> Perhaps more of it is, I'm not sure the 1412 records are the minimum
> number possible
Don't assume it's minimum. The resulting target table depends
on the order of records in the source table. In the following example,
I added 5 new records [1,1,1], [2,2,2], [3,3,3], [4,4,4] and [5,5,5].
Note how the target table is different depending on whether the
new records were appended to the source or inserted at the
start of the source. in the first run, only [5,5,5] is unmerged.
In the second, none were merged although there is only 1
more record. It also affected how the other records merged
together.
I couldn't even begin to tell you how to get the optimum result,
perhaps that's where the NP-hard gets involved.
## phase 1
##
## [set([1]), set([2]), set([3])]
## [set([1]), set([3]), set([3])]
## [set([1]), set([1]), set([7])]
## [set([1]), set([2]), set([4])]
## [set([1]), set([3]), set([4])]
## [set([2]), set([4]), set([4])]
## [set([2]), set([2]), set([4])]
## [set([3]), set([5]), set([3])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2]), set([7])]
## [set([1]), set([3]), set([7])]
## [set([2]), set([1]), set([7])]
## [set([2]), set([2]), set([7])]
## [set([2]), set([3]), set([7])]
## [set([1]), set([1]), set([1])]
## [set([2]), set([2]), set([2])]
## [set([3]), set([3]), set([3])]
## [set([4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
##
## merges: 9
##
## discards: 0
##
## [set([5]), set([5]), set([5])]
## [set([2, 4]), set([4]), set([4])]
## [set([3]), set([3, 5]), set([3])]
## [set([2]), set([2]), set([2, 4, 7])]
## [set([1]), set([1]), set([1, 7])]
## [set([2]), set([1, 3]), set([7])]
## [set([1]), set([2, 3]), set([7])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
## ========================================
## phase 2
##
## [set([5]), set([5]), set([5])]
## [set([2, 4]), set([4]), set([4])]
## [set([3]), set([3, 5]), set([3])]
## [set([2]), set([2]), set([2, 4, 7])]
## [set([1]), set([1]), set([1, 7])]
## [set([2]), set([1, 3]), set([7])]
## [set([1]), set([2, 3]), set([7])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
##
## merges: 2
##
## discards: 0
##
## [set([1]), set([2, 3]), set([3, 4, 7])]
## [set([4]), set([5]), set([3])]
## [set([2]), set([1, 3]), set([7])]
## [set([1]), set([1]), set([1, 7])]
## [set([2]), set([2]), set([2, 4, 7])]
## [set([3]), set([3, 5]), set([3])]
## [set([2, 4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
## ========================================
## phase 3
##
## [set([1]), set([2, 3]), set([3, 4, 7])]
## [set([4]), set([5]), set([3])]
## [set([2]), set([1, 3]), set([7])]
## [set([1]), set([1]), set([1, 7])]
## [set([2]), set([2]), set([2, 4, 7])]
## [set([3]), set([3, 5]), set([3])]
## [set([2, 4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
##
## merges: 0
##
## discards: 0
##
## [set([5]), set([5]), set([5])]
## [set([2, 4]), set([4]), set([4])]
## [set([3]), set([3, 5]), set([3])]
## [set([2]), set([2]), set([2, 4, 7])]
## [set([1]), set([1]), set([1, 7])]
## [set([2]), set([1, 3]), set([7])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2, 3]), set([3, 4, 7])]
## ========================================
## record order changes end result
## phase 1
##
## [set([1]), set([1]), set([1])]
## [set([2]), set([2]), set([2])]
## [set([3]), set([3]), set([3])]
## [set([4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
## [set([1]), set([2]), set([3])]
## [set([1]), set([3]), set([3])]
## [set([1]), set([1]), set([7])]
## [set([1]), set([2]), set([4])]
## [set([1]), set([3]), set([4])]
## [set([2]), set([4]), set([4])]
## [set([2]), set([2]), set([4])]
## [set([3]), set([5]), set([3])]
## [set([4]), set([5]), set([3])]
## [set([1]), set([2]), set([7])]
## [set([1]), set([3]), set([7])]
## [set([2]), set([1]), set([7])]
## [set([2]), set([2]), set([7])]
## [set([2]), set([3]), set([7])]
##
## merges: 8
##
## discards: 0
##
## [set([2]), set([1, 2, 3]), set([7])]
## [set([1]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
## [set([5]), set([5]), set([5])]
## [set([4]), set([4]), set([4])]
## [set([3]), set([3]), set([3])]
## [set([2]), set([2]), set([2])]
## [set([1]), set([1]), set([1])]
## ========================================
## phase 2
##
## [set([2]), set([1, 2, 3]), set([7])]
## [set([1]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([4])]
## [set([1]), set([2, 3]), set([3])]
## [set([5]), set([5]), set([5])]
## [set([4]), set([4]), set([4])]
## [set([3]), set([3]), set([3])]
## [set([2]), set([2]), set([2])]
## [set([1]), set([1]), set([1])]
##
## merges: 2
##
## discards: 0
##
## [set([1]), set([1]), set([1])]
## [set([2]), set([2]), set([2])]
## [set([3]), set([3]), set([3])]
## [set([4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
## [set([1]), set([2, 3]), set([3, 4])]
## [set([2]), set([2, 4]), set([4])]
## [set([3, 4]), set([5]), set([3])]
## [set([1, 2]), set([1, 2, 3]), set([7])]
## ========================================
## phase 3
##
## [set([1]), set([1]), set([1])]
## [set([2]), set([2]), set([2])]
## [set([3]), set([3]), set([3])]
## [set([4]), set([4]), set([4])]
## [set([5]), set([5]), set([5])]
## [set([1]), set([2, 3]), set([3, 4])]
## [set([2]), set([2, 4]), set([4])]
## [set([3, 4]), set([5]), set([3])]
## [set([1, 2]), set([1, 2, 3]), set([7])]
##
## merges: 0
##
## discards: 0
##
## [set([1, 2]), set([1, 2, 3]), set([7])]
## [set([3, 4]), set([5]), set([3])]
## [set([2]), set([2, 4]), set([4])]
## [set([1]), set([2, 3]), set([3, 4])]
## [set([5]), set([5]), set([5])]
## [set([4]), set([4]), set([4])]
## [set([3]), set([3]), set([3])]
## [set([2]), set([2]), set([2])]
## [set([1]), set([1]), set([1])]
## ========================================
| |
| Josiah Carlson 2007-05-15, 6:55 pm |
| mensanator@aol.com wrote:
> On May 14, 12:20 pm, wils...@gmail.com wrote:
[snip][color=darkred]
>
> Don't assume it's minimum. The resulting target table depends
> on the order of records in the source table. In the following example,
> I added 5 new records [1,1,1], [2,2,2], [3,3,3], [4,4,4] and [5,5,5].
>
> Note how the target table is different depending on whether the
> new records were appended to the source or inserted at the
> start of the source. in the first run, only [5,5,5] is unmerged.
> In the second, none were merged although there is only 1
> more record. It also affected how the other records merged
> together.
>
> I couldn't even begin to tell you how to get the optimum result,
> perhaps that's where the NP-hard gets involved.
The shorter code I provided seems to be faster. It can produce
different output due to its choice of merge operations (it does on the
example that you provided earlier), but whether the output matches seems
to be data dependent.
For an example run, I produced 2000 random elements (from the range that
wilsotc provided earlier). My algorithm executed in .08 seconds and
reduced the 2000 elements to 1843. Your algorithm executed in 3.23
seconds and produced the same 1843 elements. Increasing to 5000
elements, my algorithm runs in .22 seconds, reducing the output to 4090
elements, compared to your 35 seconds and 4074 elements.
Similar things seem to occur with higher numbers of elements; your
algorithm producing generally slightly better results for a
significantly slower execution time. There may be inputs for which your
algorithm does significantly better (in time and result quality), but it
generally seems to be slower.
Note that my manipulations of the columns of data and subsequent sorting
more or less makes the 'can x be merged with y' query very fast (you
only need to check adjacent entries), at the cost of missing some merges.
Technically speaking, the sort steps my algorithm uses are O(n^1.5logn)
time in the worst case, and the removal of used entries is O(n^2) time
(with a very small constant). The O(n^2) time step in my algorithm can
be tuned to O(n) time by copying to a new output rather than
manipulating the input in-place.
- Josiah
| |
| mensanator@aol.com 2007-05-15, 6:55 pm |
| On May 15, 12:15 pm, Josiah Carlson <josiah.carl...@sbcglobal.net>
wrote:
> mensana...@aol.com wrote:
> [snip]
>
>
>
>
> The shorter code I provided seems to be faster. It can produce
> different output due to its choice of merge operations (it does on the
> example that you provided earlier), but whether the output matches seems
> to be data dependent.
>
> For an example run, I produced 2000 random elements (from the range that
> wilsotc provided earlier). My algorithm executed in .08 seconds and
> reduced the 2000 elements to 1843. Your algorithm executed in 3.23
> seconds and produced the same 1843 elements. Increasing to 5000
> elements, my algorithm runs in .22 seconds, reducing the output to 4090
> elements, compared to your 35 seconds and 4074 elements.
>
> Similar things seem to occur with higher numbers of elements; your
> algorithm producing generally slightly better results for a
> significantly slower execution time. There may be inputs for which your
> algorithm does significantly better (in time and result quality), but it
> generally seems to be slower.
There were a buch of inefficiencies in the original code I posted.
No need to make a list of candidates for merging and then repeat
the tests on the candidates. I changed it so a candidate is
immediately checked and merged if valid (no more need for a
seperate is_valid() function). That can save a lot of time.
def merge(target,source):
dispatched = False
tlen = len(target)
i = 0
merge_count = 0
discard_count = 0
while i<tlen and not dispatched:
status = 0
eq_status = 0
last_nonsubset = 0
for j,k in enumerate(source):
if k.issubset(target[i][j]):
# subset test checks if this combo
# is completely covered
status += 1
# while we're at it, see if they are also equal
if k==target[i][j]:
eq_status += 1
else:
# 'last' is 'only' when status 2
last_nonsubset = j
if status==0 or status==1:
# can't merge, check next target record
dispatched = False
elif status==2:
# a merge candidate...
# ...if the subsets are also equal
if eq_status==2:
target[i][last_nonsubset] = target[i]
[last_nonsubset].union(source[last_nonsubset])
merge_count += 1
dispatched = True
else:
dispatched = False
elif status==3:
# it's covered, can discard
dispatched = True
discard_count += 1
i += 1
if not dispatched:
# otherwise append it
target.append(source)
return (merge_count,discard_count)
t1 = [ \
[set([1]),set([1]),set([1])], \
[set([2]),set([2]),set([2])], \
[set([3]),set([3]),set([3])], \
[set([4]),set([4]),set([4])], \
[set([5]),set([5]),set([5])], \
[set([1]),set([2]),set([3])], \
[set([1]),set([3]),set([3])], \
[set([1]),set([1]),set([7])], \
[set([1]),set([2]),set([4])], \
[set([1]),set([3]),set([4])], \
[set([2]),set([4]),set([4])], \
[set([2]),set([2]),set([4])], \
[set([3]),set([5]),set([3])], \
[set([4]),set([5]),set([3])], \
[set([1]),set([2]),set([7])], \
[set([1]),set([3]),set([7])], \
[set([2]),set([1]),set([7])], \
[set([2]),set([2]),set([7])], \
[set([2]),set([3]),set([7])], \
]
over = False
phase = 1
while not over:
print 'phase %d\n' % (phase)
t2 = []
for i in t1: print i
print
merge_total = 0
discard_total = 0
while t1:
merges_did = merge(t2,t1[-1])
t1.pop()
merge_total += merges_did[0]
discard_total += merges_did[1]
print ' merges: %d\n' % (merge_total)
print 'discards: %d\n' % (discard_total)
for i in t2: print i
if merge_total==0:
over = True
t1 = t2[:]
print '='*40
phase += 1
>
> Note that my manipulations of the columns of data and subsequent sorting
> more or less makes the 'can x be merged with y' query very fast (you
> only need to check adjacent entries), at the cost of missing some merges.
>
> Technically speaking, the sort steps my algorithm uses are O(n^1.5logn)
> time in the worst case, and the removal of used entries is O(n^2) time
> (with a very small constant). The O(n^2) time step in my algorithm can
> be tuned to O(n) time by copying to a new output rather than
> manipulating the input in-place.
>
> - Josiah- Hide quoted text -
>
> - Show quoted text -
| |
| Josiah Carlson 2007-05-15, 9:55 pm |
| mensanator@aol.com wrote:
> On May 15, 12:15 pm, Josiah Carlson <josiah.carl...@sbcglobal.net>
> wrote:
[snip]
>
> There were a buch of inefficiencies in the original code I posted.
> No need to make a list of candidates for merging and then repeat
> the tests on the candidates. I changed it so a candidate is
> immediately checked and merged if valid (no more need for a
> seperate is_valid() function). That can save a lot of time.
It is about 15-20% faster than the old version (22 seconds versus 26
seconds on 4000 items, compared to .16 seconds for mine), but your
scanning through the sequence then comparing the content of the items is
really the bottleneck. If you could find some way to speed it up, I
believe your algorithm could be practical for longer sequences.
- Josiah
| |
| Adteddick 2007-05-17, 2:31 am |
| the alleged britney spears sex tape!
http://celebsvips.com/b1.jpg
Looks like the alleged Britney Spears K-fed tapes may have finally gotten into the wrong hands. Britney should try investing a tiny percentage of her money into a decent camera! Funny how a video surfaces a day after their divorce. Looks like Kevin is out for revenge!
A source told US W ly magazine: "He has threatened to release raunchy footage of the two taken before Spears looked pregnant."
This is the hottest scandal since the release of the Paris Hilton Sex Tape! Its not conclusive but damn, if thats Britney she can really wax a pole! This is what weve all been waiting for. | |
| wilsotc@gmail.com 2007-05-18, 7:55 am |
| On May 14, 1:20 pm, wils...@gmail.com wrote:
> On May 14, 12:54 pm, wils...@gmail.com wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> This would be an excellent way of quantifying redundant information in
> tabular data. I know that my flat table is at least 59% redundant.
> Perhaps more of it is, I'm not sure the 1412 records are the minimum
> number possible.
test post
| |
| wilsotc@gmail.com 2007-05-18, 6:55 pm |
| On May 15, 4:35 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:[color=darkred]
> On May 15, 12:15 pm, Josiah Carlson <josiah.carl...@sbcglobal.net>
> wrote:
>
>
>
>
>
>
>
>
>
>
> There were a buch of inefficiencies in the original code I posted.
> No need to make a list of candidates for merging and then repeat
> the tests on the candidates. I changed it so a candidate is
> immediately checked and merged if valid (no more need for a
> seperate is_valid() function). That can save a lot of time.
>
> def merge(target,source):
> dispatched = False
> tlen = len(target)
> i = 0
> merge_count = 0
> discard_count = 0
> while i<tlen and not dispatched:
> status = 0
> eq_status = 0
> last_nonsubset = 0
> for j,k in enumerate(source):
> if k.issubset(target[i][j]):
> # subset test checks if this combo
> # is completely covered
> status += 1
> # while we're at it, see if they are also equal
> if k==target[i][j]:
> eq_status += 1
> else:
> # 'last' is 'only' when status 2
> last_nonsubset = j
> if status==0 or status==1:
> # can't merge, check next target record
> dispatched = False
> elif status==2:
> # a merge candidate...
> # ...if the subsets are also equal
> if eq_status==2:
> target[i][last_nonsubset] = target[i]
> [last_nonsubset].union(source[last_nonsubset])
> merge_count += 1
> dispatched = True
> else:
> dispatched = False
> elif status==3:
> # it's covered, can discard
> dispatched = True
> discard_count += 1
> i += 1
> if not dispatched:
> # otherwise append it
> target.append(source)
> return (merge_count,discard_count)
>
> t1 = [ \
> [set([1]),set([1]),set([1])], \
> [set([2]),set([2]),set([2])], \
> [set([3]),set([3]),set([3])], \
> [set([4]),set([4]),set([4])], \
> [set([5]),set([5]),set([5])], \
> [set([1]),set([2]),set([3])], \
> [set([1]),set([3]),set([3])], \
> [set([1]),set([1]),set([7])], \
> [set([1]),set([2]),set([4])], \
> [set([1]),set([3]),set([4])], \
> [set([2]),set([4]),set([4])], \
> [set([2]),set([2]),set([4])], \
> [set([3]),set([5]),set([3])], \
> [set([4]),set([5]),set([3])], \
> [set([1]),set([2]),set([7])], \
> [set([1]),set([3]),set([7])], \
> [set([2]),set([1]),set([7])], \
> [set([2]),set([2]),set([7])], \
> [set([2]),set([3]),set([7])], \
> ]
>
> over = False
>
> phase = 1
>
> while not over:
> print 'phase %d\n' % (phase)
> t2 = []
> for i in t1: print i
> print
> merge_total = 0
> discard_total = 0
> while t1:
> merges_did = merge(t2,t1[-1])
> t1.pop()
> merge_total += merges_did[0]
> discard_total += merges_did[1]
> print ' merges: %d\n' % (merge_total)
> print 'discards: %d\n' % (discard_total)
> for i in t2: print i
> if merge_total==0:
> over = True
> t1 = t2[:]
> print '='*40
> phase += 1
>
>
>
>
>
>
At this point, the algorithm merges two records if one records arrays
are contained in the other records respective arrays. This gives a
good reduction in the number of records and >50% compression. Because
each element in the arrays represent a record in the relational
tables, by reducing the number of total elements, the algorithm can be
made even more useful for database conversions and I think compression
in general.
If any two records have a two arrays where the first records arrays
are contained in the second records arrays, the elements of the
records whose sum of matching array elements is larger are moved to
the smaller records elements, and the 3rd fields elements are copied.
If the total elements of the result are smaller, the change is kept
otherwise the merge doesn't complete.
for example, these two records:
[set([254]), set([1, 2, 3, 4, 15, 17, 21]), set([2])
[set([254]), set([1, 2, 3, 4, 15, 17, 21, 1369]), set([3])
become
[set([254]), set([1, 2, 3, 4, 15, 17, 21]), set([2,3])
[set([254]), set([1369]), set([3])
Is this a novel way to do compression or has this been done a 1000
times? Again I think this is generally applicable.
I tried to post this before, but it went into limbo.
| |
| Josiah Carlson 2007-05-18, 9:55 pm |
| wilsotc@gmail.com wrote:
> On May 15, 4:35 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:
[snip]
> At this point, the algorithm merges two records if one records arrays
> are contained in the other records respective arrays. This gives a
> good reduction in the number of records and >50% compression. Because
> each element in the arrays represent a record in the relational
> tables, by reducing the number of total elements, the algorithm can be
> made even more useful for database conversions and I think compression
> in general.
The kinds of compression you are seeing are a function of your data
element distributions and the size of your element universe. Depending
on the universe size and/or data distribution, one may not do as well.
(I suggest you use the algorithm I provide for any longer sequences, if
only to get an idea of expected performance...it is much faster)
The algorithm that mensanator@gmail.com provided will also dedupe
elements. That is, if you have two identical elements, one of them will
be removed. That is a lossy process, and removes the applicability in
general data compression.
Further, either algorithm necessarily destroys row orderings. If it
doesn't matter, great. If order does matter, and order isn't defined as
being some sorted version of the rows, then these algorithms will
destroy the order.
> If any two records have a two arrays where the first records arrays
> are contained in the second records arrays, the elements of the
> records whose sum of matching array elements is larger are moved to
> the smaller records elements, and the 3rd fields elements are copied.
> If the total elements of the result are smaller, the change is kept
> otherwise the merge doesn't complete.
>
> for example, these two records:
> [set([254]), set([1, 2, 3, 4, 15, 17, 21]), set([2])
> [set([254]), set([1, 2, 3, 4, 15, 17, 21, 1369]), set([3])
> become
> [set([254]), set([1, 2, 3, 4, 15, 17, 21]), set([2,3])
> [set([254]), set([1369]), set([3])
>
> Is this a novel way to do compression or has this been done a 1000
> times? Again I think this is generally applicable.
I don't know if it has been done before. But I also don't believe that
this method is generally applicable to data compression.
For certain database applications for which some fixed number of columns
have some limited and fixed range for which the ordering of the rows are
inconsequential, these methods *can* reduce the number of rows.
However, very few data compression problems are related to limited-range
unordered column-based data.
I also think you would find it quite difficult to map the compression of
a plain text file into this area without jumping through a bunch of
hoops and generally doing poorly compared to the 14 year old gzip.
On a somewhat related note, there is a database or database-like
application out there (maybe PyTables?) for which sorted indexes are
stored as pairs of (value, entry). The entry is the primary key for the
particular row with that value, but because all of those pairs are
sorted by value, one can think of the packed representation of all of
the 'value's as really columns of bytes, each column having a very
particular data distribution (highest order byte nondecreasing, very
long runs of the same values, etc.). One implementation claimed that
the values in the index could be reduced to 1/4 their size on average,
and with some tricks, have faster access time than the uncompressed data.
> I tried to post this before, but it went into limbo.
Sometimes google groups is a bit laggy.
- Josiah
|
|
|
|
|