Home > Archive > Compression > November 2005 > qsort in BWT
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
|
| Hi,
In BWT.CPP in the article
http://www.dogma.net/markn/articles/bwt/bwt.htm, the author does a
straight up memcmp while qsorting the rotations... he does not consider
the wrapped portion of the sorted strings. Wont this lead to a problem
in the output of the last column L?
For example, if the original string is fabababab, then consider the
following two rotations:
S1: ababfabab AND
S2: abfababab
S2 is lexically greater than S1, but if you do a straight memcmp the
way it is done in the article, then S2 will be less than S1.
Thanks
Lyle
| |
| Willem 2005-11-20, 6:55 pm |
| Lyle wrote:
) Hi,
) In BWT.CPP in the article
) http://www.dogma.net/markn/articles/bwt/bwt.htm, the author does a
) straight up memcmp while qsorting the rotations... he does not consider
) the wrapped portion of the sorted strings. Wont this lead to a problem
) in the output of the last column L?
)
) For example, if the original string is fabababab, then consider the
) following two rotations:
)
) S1: ababfabab AND
) S2: abfababab
)
) S2 is lexically greater than S1, but if you do a straight memcmp the
) way it is done in the article, then S2 will be less than S1.
He probably assumes a virtual EOF at the end of the buffer.
I assume the memcmp won't access memory beyond the end of the buffer ?
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
|
| He is passing in the length of the smaller sequence of bytes to
memcmp... so memcmp isnt stomping past the buffer, but it will fail to
see a byte that could have altered the sort order...
| |
| 3rdtry@willets.org 2005-11-20, 6:55 pm |
|
Lyle wrote:
> He is passing in the length of the smaller sequence of bytes to
> memcmp... so memcmp isnt stomping past the buffer, but it will fail to
> see a byte that could have altered the sort order...
That's equivalent to using a terminating metacharacter. If both
strings match up to the end of the shorter one, then pick the shorter
one as the smallest. That's equivalent to saying there's a magic
terminator $ on the end of the shorter string, where $ < (char) 0-255.
This metacharacter can't be represented as a byte, so it's a real pain
to program. L vectors often have a separate integer saying where the
$ character is.
This makes the BWT equivalent to suffix sorting; the rotated portion
doesn't matter, since there are never any ties past $, and $ never
lines up with itself when comparing different suffixes/rotations.
| |
|
| What about as in my example...
If the original string is fabababab$, then the two rotations:
S1: ababab$fab and S2: ab$fababab will compare differently...
According to BWT, ab$fababab is lexically greater than ababab$fab...
| |
|
| And just to add... I agree there are never any ties past the $, but the
sort ordering will depend on the rotation past the $. Why are we
discarding that sort order information... doesent the BWT depend on
that?
| |
| Willem 2005-11-20, 6:55 pm |
| Lyle wrote:
) What about as in my example...
) If the original string is fabababab$, then the two rotations:
)
) S1: ababab$fab and S2: ab$fababab will compare differently...
) According to BWT, ab$fababab is lexically greater than ababab$fab...
Remember, $ is a character smaller than all other characters.
So ab$fananan is lexically smaller than ababab$fab.
More generally, ab$<anything> is lexically smaller than ab<anything>.
So, with the addition of the EOF character, you can always stop comparing
before there is a wraparound.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
|
| Hi Willem... comsider the follwing string YXX
With the magic EOF, we have YXX$
The sort matrix is
$YXX
X$YX
XX$Y
YXX$
And so L = XXY
But if you dont consider a magic EOF, then the sort matrix will be
XXY
XYX
YXX
And L will be YXX which is different...
Thats my confusion...
| |
| Willem 2005-11-20, 6:55 pm |
| Lyle wrote:
) Hi Willem... comsider the follwing string YXX
) With the magic EOF, we have YXX$
)
) The sort matrix is
) $YXX
) X$YX
) XX$Y
) YXX$
) And so L = XXY
)
) But if you dont consider a magic EOF, then the sort matrix will be
) XXY
) XYX
) YXX
) And L will be YXX which is different...
)
) Thats my confusion...
Of course they will be different, there is a (virtual) $ in there after
all. The first L will not be XXY, it will be XXY$
For example, in the case of XYX$, you get:
$XYX
X$XY
XYX$
YX$X
So L = XY$X
This works by outputting XYX and then storing the position of the $.
Note that in the original BWT you have to store the position of the
first character, but in this case, it follows the $, so you can get
away with just storing the position of the $ and work from there.
This is, incidentally, how bwc works. Take a look at the unbwt routine.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
|
| OK... thats what I wanted to confirm... basically we are all saying "if
you consider a virtual EOF, then the last column L will be different
than not having concidered the EOF and comparing the rotations as well"
If thats the case, I agree... now my second question would be, wont we
be effecting how MTF and RLE stages past BWT operate in doing that? Or
is it the case that it doesent matter how the last column L is
constructed with respect to the EOF and that the performance of the
overal compression is the same regardless of sort ordering?
Thanks
Lyle
| |
| Willem 2005-11-20, 6:55 pm |
| Lyle wrote:
) OK... thats what I wanted to confirm... basically we are all saying "if
) you consider a virtual EOF, then the last column L will be different
) than not having concidered the EOF and comparing the rotations as well"
)
) If thats the case, I agree... now my second question would be, wont we
) be effecting how MTF and RLE stages past BWT operate in doing that? Or
) is it the case that it doesent matter how the last column L is
) constructed with respect to the EOF and that the performance of the
) overal compression is the same regardless of sort ordering?
If you use wraparound, then the beginning of the string will be used as
part of the context of the end of the string. If you don't, then it won't.
So it depends on if the beginning of the string, seen as context, fits the
end of the string. I would guess offhand it won't, so using a virtual EOF
would tend to improve compression. I'd guess by half a byte or maybe a
whole byte per block. Maybe less. Dunno.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
| |
| Phil Carmody 2005-11-20, 6:55 pm |
| "Lyle" <x_coder@hotmail.com> writes:
> He is passing in the length of the smaller sequence of bytes to
> memcmp... so memcmp isnt stomping past the buffer, but it will fail to
> see a byte that could have altered the sort order...
Nope, he's (probably) assuming that there's a virtual character
beyond the last real character which compares greater than any
character in the string. Therefore all comparisons can terminate
thereupon.
Phil
--
If a religion is defined to be a system of ideas that contains unprovable
statements, then Godel taught us that mathematics is not only a religion, it
is the only religion that can prove itself to be one. -- John Barrow
|
|
|
|
|