For Programmers: Free Programming Magazines  


Home > Archive > Compression > October 2006 > New top result on large text benchmark









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author New top result on large text benchmark
Matt Mahoney

2006-10-16, 6:55 pm

durilca4linux_2 by Dmitry Shkarin has taken the top spot on the large
text benchmark.
http://cs.fit.edu/~mmahoney/compression/text.html

This is not an entry for the Hutter prize. The top spot on the smaller
enwik8 file is still held by Alexander Ratushnyak for paq8hp5 (which
will be official when the source code is released on Oct. 25 to satisfy
the GPL license). Also, it exceeds memory limits for the Hutter prize.

durilca4linux_2 is a PPM compressor based on ppmonstr with dictionary
preprocessing specialized for this benchmark, which is allowed under
the rules. It includes a warning that use on other files may cause
data loss. It runs under Linux on an Athlon-64 with 3 GB memory (2 GB
for enwik8). There are options to trade off memory vs. compression.
Compression time is reported about 80 minutes on enwik9 (vs. about 18
hours for paq8hp5 on my 2.2 GHz Athlon-64 with 2 GB memory). I have
not been able to verify enwik9 due to disk thrashing.

Both paq8hp5 and durilca4linux use dictionary preprocessing, where the
dictionaries group words with similar similar suffixes and similar
syntactic and semantic roles. There is no word on how this
organization was done. For paq8hp5 the organization was done manually
with the help of utilities to sort on suffixes. (I am experimenting
with techniques to build such dictionaries automatically).

These results leave unresolved which is the better compression
algorithm, PPM or context mixing (CM). CM is top ranked on most
benchmarks. However, it appears that PPM is faster and more memory
efficient, which is important for larger files where memory is tightly
constrained, and because it leads to faster development and tuning.
However, it is easier to model natural language syntax and semantics
using context mixing. With CM and a properly organized dictionary, it
is possible to discard the low bits of dictionary codes to achieve this
modeling throughout the prediction pipeline. With PPM, this modeling
is restricted to postprocessing steps such as secondary symbol
estimation (SSE).

-- Matt Mahoney

Rudi Cilibrasi

2006-10-17, 3:55 am

Thanks for this update.

I think CM has much greater potential to be worked on by more than one
person due to the enormously decoupled and flexible context mixing
system. It's much easier to iteratively "add intelligence" to CM style
programs based on my experience and I don't think I would have tried
modifying a PPM program in a way that I thought could possibly be
useful for anything other than this benchmark. Just wayyy too
complicated. Best regards,

Rudi

Matt Mahoney wrote:
> durilca4linux_2 by Dmitry Shkarin has taken the top spot on the large
> text benchmark.
> http://cs.fit.edu/~mmahoney/compression/text.html
>
> This is not an entry for the Hutter prize. The top spot on the smaller
> enwik8 file is still held by Alexander Ratushnyak for paq8hp5 (which
> will be official when the source code is released on Oct. 25 to satisfy
> the GPL license). Also, it exceeds memory limits for the Hutter prize.
>
> durilca4linux_2 is a PPM compressor based on ppmonstr with dictionary
> preprocessing specialized for this benchmark, which is allowed under
> the rules. It includes a warning that use on other files may cause
> data loss. It runs under Linux on an Athlon-64 with 3 GB memory (2 GB
> for enwik8). There are options to trade off memory vs. compression.
> Compression time is reported about 80 minutes on enwik9 (vs. about 18
> hours for paq8hp5 on my 2.2 GHz Athlon-64 with 2 GB memory). I have
> not been able to verify enwik9 due to disk thrashing.
>
> Both paq8hp5 and durilca4linux use dictionary preprocessing, where the
> dictionaries group words with similar similar suffixes and similar
> syntactic and semantic roles. There is no word on how this
> organization was done. For paq8hp5 the organization was done manually
> with the help of utilities to sort on suffixes. (I am experimenting
> with techniques to build such dictionaries automatically).
>
> These results leave unresolved which is the better compression
> algorithm, PPM or context mixing (CM). CM is top ranked on most
> benchmarks. However, it appears that PPM is faster and more memory
> efficient, which is important for larger files where memory is tightly
> constrained, and because it leads to faster development and tuning.
> However, it is easier to model natural language syntax and semantics
> using context mixing. With CM and a properly organized dictionary, it
> is possible to discard the low bits of dictionary codes to achieve this
> modeling throughout the prediction pipeline. With PPM, this modeling
> is restricted to postprocessing steps such as secondary symbol
> estimation (SSE).
>
> -- Matt Mahoney


James A. Bowery

2006-10-17, 6:55 pm

I noticed you don't have a column for the bits per character. This
entry achieves about 1.1 bits per character (1.09470512). At what
point would you say the intelligence threshold is reached?

--------
md5sum of a program I wrote over the last several days:
3d02ceaaa7c2e487def29e428f39516a -

Phil Carmody

2006-10-17, 6:55 pm

"James A. Bowery" <jabowery@gmail.com> writes:
> I noticed you don't have a column for the bits per character. This
> entry achieves about 1.1 bits per character (1.09470512). At what
> point would you say the intelligence threshold is reached?


How can you be sure you aren't measuring the stupidity of Wikipedia
rather than the intelligence of the compression algorithm?

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
michael

2006-10-17, 6:55 pm



On Oct 17, 8:01 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
> "James A. Bowery" <jabow...@gmail.com> writes:
>
> rather than the intelligence of the compression algorithm?
>
> Phil
> --
> "Home taping is killing big business profits. We left this side blank
> so you can help." -- Dead Kennedys, written upon the B-side of tapes of
> /In God We Trust, Inc./.


Perhaps modeling stupidity is harder ... like drawing like a child.

- Michael

Phil Carmody

2006-10-17, 6:55 pm

"michael" <michael@michael-maniscalco.com> writes:
> On Oct 17, 8:01 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
> wrote:
[color=darkred]
> Perhaps modeling stupidity is harder ... like drawing like a child.


I think the particular variant of stupidity is that of predictability
and unoriginality, rather than an avoidance of commonly accepted
convention. So perhaps it's more like painting a Mondrian (particularly
if trained with a corpus of Malevic's wartime output) than a Klee.

(Though oddly, having said this, my personal tastes favour Mondrian
and disfavour Klee.)

Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
Matt Mahoney

2006-10-17, 6:55 pm

James A. Bowery wrote:
> I noticed you don't have a column for the bits per character. This
> entry achieves about 1.1 bits per character (1.09470512). At what
> point would you say the intelligence threshold is reached?


It hasn't been measured for this data set, but I think it is safe to
say we are not there yet.

Shannon's estimate of 0.6 to 1.3 bpc is based on human subjects
predicting written English from a 27 character alphabet (A-Z and space)
with the help of a dictionary and character trigram frequency tables.
The subject guesses until correct, then moves to the next letter. The
uncertainty is because different probability distributions (and
therefore different entropies) can lead to the same sequence of
observed guesses.

Cover and King tried to remove this uncertainty by having subjects
assign probabilities directly in a gambing game. Their result was 1.3
bpc but I don't believe it. People are bad at estimating numeric
probabilities, otherwise we would not buy lottery tickets and
insurance. In psycological experiments, people tend to assign roughly
equal probabilities to all possible outcomes regardless of actual
likelihood, which would artificially raise the entropy. Also, their
result is based on just one sentence, which took 5 hours per person to
measure, and there was no mention in the paper about allowing
statistcal assistance as Shannon did.

In 1999 I did some simulations of the Shannon game with simple models
and concluded that the actual result of Shannon's experiments were
about in the middle of the range, about 1 bpc. But interpolating this
result to sophisticated models is probably a stretch.
http://cs.fit.edu/~mmahoney/dissertation/entropy1.html

Even if we knew this number, we would still have to do measurements on
the actual data set. This is different than the text used by Shannon
and by Cover and King. The Wikipedia text is about 75% English text
and 25% markup, XML, tables, links, etc. The text includes
capitalization, punctuation, and numbers. There is also a highly
compressable region in the middle of enwik9 (about 15% of the file)
that appears to be a collection of synthetic articles generated from a
census table. Even removing this section, the entropy of the text is
hardly uniform.

I think it is possible to improve on the Shannon game using a form of
human assisted compression competition as a multiplayer online game.
The player who compresses a string the smallest wins. It would work
like the Shannon game, except that the computer first presents the
probability distribution as a pie chart using a simple n-gram model,
and the players can make adjustments based on their high level
knowledge.

Until this is done, we won't know when a compressor achieves AI (in the
sense of a Turing test). Even then, it is still a subjective test
because different people will produce different measurements. The test
is fully objective only when comparing one compressor to another.

But I am pretty sure we are not there. paq8hp5 and durilca are doing
some syntactic and semantic modeling by virtue of grouping related
words in the dictionary. But this does not model context free grammar,
and does not model variant forms of words (e.g. rotates, rotation,
rotating) which appear in the dictionaries as independent entries. It
is not possible to capture all of the syntactic and semantic relations
between words in a one dimensional ordering at fixed intervals. The
space has hundreds of dimensions.

So I am confident there is room for improvement. However, speed and
memory constraints are significant. These programs have to sacrifice
compression to achieve these limits. Even in simple n-gram models, we
see significant gains just by allowing more time and memory. The human
brain has vastly more computational power than the machines we are
trying to do the same thing on. Language modeling has to be
computationally intensive. If it wasn't, we probably would have solved
the problem long ago.

-- Matt Mahoney

Dmitry Shkarin

2006-10-18, 3:55 am

Hello, Matt!

MM> Both paq8hp5 and durilca4linux use dictionary preprocessing, where the
MM> dictionaries group words with similar similar suffixes and similar
MM> syntactic and semantic roles. There is no word on how this
MM> organization was done. For paq8hp5 the organization was done manually
MM> with the help of utilities to sort on suffixes. (I am experimenting
MM> with techniques to build such dictionaries automatically).

Sorting 80000 words by hands? No, it is not for me. ;-)
Words in dictionary are groupped in such way that words with equal
higher byte of index of word in dictionary are encountered in equal contexts
in original text. Words inside group are sorted by suffix for better
compression of dictionary. There is no syntactic and semantic
considerations.


Ian Parker

2006-10-18, 6:55 pm


James A. Bowery wrote:
> I noticed you don't have a column for the bits per character. This
> entry achieves about 1.1 bits per character (1.09470512). At what
> point would you say the intelligence threshold is reached?
>
> --------
> md5sum of a program I wrote over the last several days:
> 3d02ceaaa7c2e487def29e428f39516a -


I would say where you are predicting the next word, such that errors
are not made in translation. "primavera" must come higher in
probability than "ressorte" or "mamanthal"

- La estacion de resorte

- Ian Parker

BTW - Entropy content is clearly Log(N!/n!m!o!p!..etc) where n,m,o,p
are first, second, third and fourth choices. Clearly Figure of Merit is
higher with more guesses near the top. I have used the Gosper
approximation for n! to calculate actual entropies.

Matt Mahoney

2006-10-18, 6:55 pm

Dmitry Shkarin wrote:
> Hello, Matt!
>
> MM> Both paq8hp5 and durilca4linux use dictionary preprocessing, where the
> MM> dictionaries group words with similar similar suffixes and similar
> MM> syntactic and semantic roles. There is no word on how this
> MM> organization was done. For paq8hp5 the organization was done manually
> MM> with the help of utilities to sort on suffixes. (I am experimenting
> MM> with techniques to build such dictionaries automatically).
>
> Sorting 80000 words by hands? No, it is not for me. ;-)
> Words in dictionary are groupped in such way that words with equal
> higher byte of index of word in dictionary are encountered in equal contexts
> in original text. Words inside group are sorted by suffix for better
> compression of dictionary. There is no syntactic and semantic
> considerations.


I am doing the same type of experiments, and it does really group
related words. Here is a small snip of a 32K word dictionary
automatically generated from enwik8. Words near each other often have
related meanings. On a larger scale, groups of words have related
syntactic roles (e.g. adjectives or nouns).

w
job
woman
row
teacher
brilliant
permanent
nationwide
false
charity
hidden
silent
beautiful
mysterious
human
scientific
trade
medical
social
economic
financial
spiritual
political
religious
cultural
linguistic
personal
private
commercial
literary
food
hunting
oil
manufacturing
dry
weather
personality
harmony
industrial
agricultural
urban
tropical
atmospheric
ambient
Midway
securities
mental
noble
moral
communal
weak
nose
priest
ritual
vocal
minimal
marginal
tragic

The dictionary is generated by collecting the 32K most frequent words
and assigning a 256 element context vector to each one. The context
vector is the sum of hashes of surrounding words weighed inversely by
distance, using different hash functions for preceding and following
contexts. Then I normalize the vectors to unit length and use bottom
up clustering to form a binary tree. For each word, I pair it with the
word with the closest context vector from the remaining words not
already taken. Then I compute the context vector of the parent node by
averaging of the child vectors (weighted by frequency), normalize, and
pair them again. Repeat until just a root node remains. The
dictionary is the output of a depth first traversal of the tree.

This allows modeling on the high bits of the dictionary codes using
both order-n models for syntax and distant bigram (sparse) models for
semantics. The advantage of this method is it is fairly fast. The
divantage is that you can't fully capture the syntactic and semantic
relations between words in a one dimensional ordering. The model does
not have access to the original context vectors. The list sometimes
produces groupings of unrelated words because there are no good choices
remaining.

-- Matt Mahoney

cwenner

2006-10-20, 3:55 am


Ben Rudiak-Gould wrote:
> Matt Mahoney wrote:
>
> I've heard this argument before, and I don't buy it. Here's what I wrote
> about it last time:
>
> http://groups.google.com/group/sci....d289ca266d45ce5
> Message-ID: <dgeitc$42q$1@gemini.csx.cam.ac.uk>
>
> And that goes for insurance too. People buy insurance to make their future
> net worth more predictable, not to maximize it.


Russel&Norvig, or perhaps it was "Learning bayesian nets" gave the same
phenomena as an example of the non-linearity of monetary resources.
They had a pretty name for this difference which I do not recall, but
the argument is as such that the expected utility of the eventualities
(i.e. P(Ev) EU[Ev]) is greater than that of cost. Reducing your savings
from $11000 to $10000 might be less of an issue than the risk of having
your house burnt down, reducing the situation to the brink of ruin.
Though this might only constitute a drop of some $80000 in a one to a
100 shot, costing you at average $800. $1000 is easy to replace in your
current situation while $800 when you can barely stay alive surely
isn't. The utility of the monetary resources isn't linear and follow
more of an S-curve and would render an insurance rational and the
behaviour mutually beneficial. I'm not sure it would explain gambling
from a stable situation however. The emperical studies they quoted
implied that the relative value in gaining resources drops as the prior
grows, if it was a large positive number (i.e. the curve is below the
line for large amounts, $1000 isn't as much worth when you have $1000
as when you have $10000). In a desperate situation, gambling against
the odds could be reasonable they argued, it might not be the case for
stable situations and previously mentioned explanations or
irrationality could be viable.

>
>
> Now that I do buy.
>
> -- Ben


Jim Leonard

2006-10-23, 6:55 pm

Matt Mahoney wrote:
> It includes a warning that use on other files may cause
> data loss. It runs under Linux on an Athlon-64 with 3 GB memory


Why do the rules allow such inflexibility? The 3GB memory I can
understand, but use on other files causing data loss -- why is that
allowed? (Is it allowed because it isn't specifically DISallowed?)

makc.the.great@gmail.com

2006-10-30, 6:55 pm


Matt Mahoney wrote:
> durilca4linux_2 by Dmitry Shkarin has taken the top spot on the large
> text benchmark.
> http://cs.fit.edu/~mmahoney/compression/text.html


btw, "durilka" in russian means something like "fraud".

bybell@rocketmail.com

2006-10-30, 6:55 pm

Matt Mahoney wrote:

> The list sometimes
> produces groupings of unrelated words because there are no good choices
> remaining.


Matt,

In order to guess closer groupings across thousands of entries,
couldn't you do something like analyze a word list for language which
uses a synthesizing written system like Hanzi or Kanji and filter
through the English equivalents? For example:

=E5=A5=B3 [jyu5/neoi5/neoi6] woman, girl; feminine; rad. 38
=E5=A5=B3=E4=B8=BB=E4=BA=BA [neoi5 zyu2 jan4] hostess; mistress
=E5=A5=B3=E4=BA=BA [neoi5 jan4*2] woman; girl
=E5=A5=B3=E4=BA=BA=E5=91=B3 [neoi5 jan4*2 mei6] feminine
=E5=A5=B3=E4=BA=BA=E8=A1=97 [neoi5 jan4*2 gaai1] lady's market
=E5=A5=B3=E4=BB=94 [neoi5 zai2] girl
=E5=A5=B3=E4=BB=95=E5=85=A7=E8=A1=A3 [neoi5 si6 noi6 ji1] lingerie
=E5=A5=B3=E4=BD=A3 [neoi5 jung6*2] (female) maid
=E5=A5=B3=E4=BF=AE=E9=81=93 [neoi5 sau1 dou6] convent
=E5=A5=B3=E5=84=AA [neoi5 jau1] an actress
=E5=A5=B3=E5=85=92 [neoi5 ji4] daughter
=E5=A5=B3=E5=A3=AB [neoi5 si6] lady; Ms

..=2E.based on =E5=A5=B3 you would have the above English words placed closer
together in context as they have the same head character in Cantonese
Chinese. At the finer level, one could extract for example the
left-hand semantic component (radical) although that doesn't always
work.

-Tony

[sorry if the unicode utf-8 characters mess with your browser.]

Matt Mahoney

2006-10-30, 6:55 pm


makc.the.great@gmail.com wrote:
> Matt Mahoney wrote:
>
> btw, "durilka" in russian means something like "fraud".


I bet that's deliberate (like my BARF compressor). I understand the
English version is Dirty Useless Really Illusory Compressor/Archiver.
Maybe Dmitry can explain...

-- Matt Mahoney

Matt Mahoney

2006-10-30, 6:55 pm

Jim Leonard wrote:
> Matt Mahoney wrote:
>
> Why do the rules allow such inflexibility? The 3GB memory I can
> understand, but use on other files causing data loss -- why is that
> allowed? (Is it allowed because it isn't specifically DISallowed?)


I designed the large text benchmark with the goal of solving the AI
problem. (That is also the goal of the Hutter prize, but they are
different contests because we could never agree on the technical
details of the rules). In fact, the rules (of both) specifically allow
compressors that don't work on all files. If we had such a rule it
would be easy to get around it by performing some trivial compression
(such as copying) on every input except the benchmark. In fact, the
rules only require a decompressor (as in the Calgary challenge).
http://cs.fit.edu/~mmahoney/compression/textrules.html

One important difference in the rules is that I did not put any hard
limits on speed or memory. I think these are necessary when prize
money is at stake, but I also think that good solutions to the language
modeling problem are going to be computationally expensive (since all
of the cheap solutions have already been tried and failed).

The other difference, of course, is the size of the data set. The
average person is exposed to the equivalent of 100 MB of text (in
spoken form) by about age 3, and 1 GB (spoken or written) as an adult.
I would expect a similar level of sophistication in the resulting
language model used by the decompressor. I was aiming higher, but
accomplishing either of these goals would be an impressive achievement.

I explain the rationale behind the design of the benchmark in more
detail in
http://cs.fit.edu/~mmahoney/compression/rationale.html

-- Matt Mahoney

Matt Mahoney

2006-10-30, 6:55 pm


bybell@rocketmail.com wrote:
> Matt Mahoney wrote:
>
>
> Matt,
>
> In order to guess closer groupings across thousands of entries,
> couldn't you do something like analyze a word list for language which
> uses a synthesizing written system like Hanzi or Kanji and filter
> through the English equivalents? For example:
>
> =E5=A5=B3 [jyu5/neoi5/neoi6] woman, girl; feminine; rad. 38
> =E5=A5=B3=E4=B8=BB=E4=BA=BA [neoi5 zyu2 jan4] hostess; mistress
> =E5=A5=B3=E4=BA=BA [neoi5 jan4*2] woman; girl
> =E5=A5=B3=E4=BA=BA=E5=91=B3 [neoi5 jan4*2 mei6] feminine
> =E5=A5=B3=E4=BA=BA=E8=A1=97 [neoi5 jan4*2 gaai1] lady's market
> =E5=A5=B3=E4=BB=94 [neoi5 zai2] girl
> =E5=A5=B3=E4=BB=95=E5=85=A7=E8=A1=A3 [neoi5 si6 noi6 ji1] lingerie
> =E5=A5=B3=E4=BD=A3 [neoi5 jung6*2] (female) maid
> =E5=A5=B3=E4=BF=AE=E9=81=93 [neoi5 sau1 dou6] convent
> =E5=A5=B3=E5=84=AA [neoi5 jau1] an actress
> =E5=A5=B3=E5=85=92 [neoi5 ji4] daughter
> =E5=A5=B3=E5=A3=AB [neoi5 si6] lady; Ms
>
> ...based on =E5=A5=B3 you would have the above English words placed closer
> together in context as they have the same head character in Cantonese
> Chinese. At the finer level, one could extract for example the
> left-hand semantic component (radical) although that doesn't always
> work.
>
> -Tony
>
> [sorry if the unicode utf-8 characters mess with your browser.]


That's very interesting. (BTW the Chinese characters display correctly
in Google/Firefox/WinXP). I didn't know that Chinese had that kind of
structure.

I guess I could also find related words using an online dictionary or
WordNet. But I think the best dictionary I could obtain for any data
set would have to come directly from that data.

-- Matt Mahoney

Phil Carmody

2006-10-30, 6:55 pm

"Matt Mahoney" <matmahoney@yahoo.com> writes:
> bybell@rocketmail.com wrote:
[color=darkred]
> I guess I could also find related words using an online dictionary or
> WordNet. But I think the best dictionary I could obtain for any data
> set would have to come directly from that data.


Online dictionaries are rubbish. Why not use a much larger
resource which reflects language as it is used rather than as
formalised or recorded by academics? There's one called
Wikipedia which I hear is a pretty huge such resource...


;-)
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com