Home > Archive > Compression > February 2007 > Semi-OT - looking for histogram
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Semi-OT - looking for histogram
|
|
| Phil Carmody 2007-01-29, 7:55 am |
| I'm looking for a histogram of the initial letters of English words
in typical usage. As it seems to be fashionable nowadays, I was
wondering if Wikipedia articles, without markup, might make a decent
corpus.
Does anyone already have such data? Note - it's only the initial
letter of each word I'm interested in, so the usual letter usage
histograms aren't applicable. (Though one column of a digraph
table would to the job.)
Cheers,
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
| |
| jdallen2000@yahoo.com 2007-02-02, 3:55 am |
|
On Jan 29, 6:04 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:
> I'm looking for a histogram of the initial letters of English words ...
No one's responded to this OT post, so
let me drive it further OT. :-)
It might be interesting to see how much this
histogram depends on the type of work. Here's
part of initial-letter histograms of five works:
The Holy Bible (Douay-Rheims Version),
Darwin's _Origin of Species_,
Shakespeare's Sonnets,
Twain's _Huckleberry Finn_
Machiavelli's _The Prince_ (Marriott's trans.)
Bible Darwin Shake. Twain Prince
----- ----- ----- ----- -----
T .194 .152 .172 .149 .183
A .123 .112 .084 .134 .119
S .070 .078 .084 .079 .049
O .063 .096 .049 .052 .073
I .054 .078 .065 .079 .059
W .059 .051 .072 .083 .067
H .070 .035 .046 .063 .078
B .043 .054 .059 .047 .052
M .036 .037 .067 .038 .035
`And' is the most common word in the _Sonnets_,
but common words there are less common than in prose.
`Thy/Thou/Thee' contributes strongly to the Sonnets'
score on `T.' Similarly `My/Me' explains the high
`M' score in the Sonnets, while `He/His/Him' leads
to the high `H' score in the Bible and _The Prince_.
The high score on `S' for each work *except* Prince
is harder to explain. The Bible and Huck Finn get
boosts from `Say' and its variants, and Origin gets
a boost from `Species,' but these boosts are
relatively small and neither applies to the Sonnets,
which *might* get their `S' boost from a poet's
preference for the sibilant sound!
The five works above are all based on ordinary
English sentences. Another work on my machine is
the underlying data file for my genealogy. After
excluding the foreign prepositions `de' and `von',
its letter frequency ranks start EARION SLTHDU,
not too far off from the famous ETAION SHRDLU,
but its initial letter frequency ranks start
OABSMPC. (Many female names start with `M',
`B' is inflated by `By/Begat', etc.)
To make this on-topic:
Dynamic probability estimation wins!
James Dow Allen
| |
| Phil Carmody 2007-02-02, 7:56 am |
| jdallen2000@yahoo.com writes:
> On Jan 29, 6:04 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
> wrote:
>
....
> Bible Darwin Shake. Twain Prince
> ----- ----- ----- ----- -----
> T .194 .152 .172 .149 .183
> A .123 .112 .084 .134 .119
> S .070 .078 .084 .079 .049
> O .063 .096 .049 .052 .073
> I .054 .078 .065 .079 .059
> W .059 .051 .072 .083 .067
> H .070 .035 .046 .063 .078
> B .043 .054 .059 .047 .052
> M .036 .037 .067 .038 .035
Can I have the whole tables, please?
I've found a Dickens one. (These are just the zeroth column of a digram table)
T=0.025998,
A=0.022940,
S=0.015709,
O=0.011031,
I=0.017355,
W=0.014617,
H=0.014539,
B=0.008593,
M=0.013583,
The differences are enormous.
Phil
--
"Home taping is killing big business profits. We left this side blank
so you can help." -- Dead Kennedys, written upon the B-side of tapes of
/In God We Trust, Inc./.
|
|
|
|
|