Code Comments
Programming Forum and web based access to our favorite programming groups.hello, on the bottom of this page <http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil says that, when processing speech sound in order to recognise speech in it, the sound is split into frequencies and then each frequency stream of data is compressed "using a variety of mathematical techniques that reduce the amount of information and emphasize those features of the speech signal important for recognizing speech". does anyone know, or could anyone have a good guess at, which compression he's using there? thanks, ben.
Post Follow-up to this messageben wrote: > hello, > > on the bottom of this page > <http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil > says that, when processing speech sound in order to recognise speech in > it, the sound is split into frequencies and then each frequency stream > of data is compressed "using a variety of mathematical techniques that > reduce the amount of information and emphasize those features of the > speech signal important for recognizing speech". does anyone know, or > could anyone have a good guess at, which compression he's using there? > > thanks, ben. Probably lossy compression: throw out what the ear can't hear. I presume that the information important to speech recognition is the same information that the human ear/brain can recognize. The article hints at how to do this: separate the input into frequency bands, similar to what the cochlea does. The frequencies important to speech are in the range 300 Hz to 8000 Hz, so you can discard frequencies outside this range. A second aspect related to compression is language modeling. Given a speech signal s, the problem is to output text t satisfying max_t P(t|s), the t such that probability P(t|s) is maximum. By Bayes law, max_t P(t|s) = max_t P(s|t)P(t). P(s|t) is the acoustic model, a mapping of text to probable vocalizations. P(t) is the language model. Thus, if P(s|"recognize speech") = P(s|"reckon eyes peach") (they sound the same), we can still make the correct translation because P("recognize speech") > P("reckon eyes peach"). What the article does not mention is that determining the language model P(t) is exactly the same problem as text compression, which is to assign a code of length log 1/P(t) to text string t. The coding problem has already been solved (arithmetic coding). But finding P(t) is hard. Otherwise we would have text compressors achieving 1 bit per character compression ratio, the entropy Shannon estimated in 1950 by having humans try to guess successive letters in text. We would also have machines passing the Turing test for AI. Turing defined AI as the ability of a machine to answer questions such that a human judge could not tell if the answers came from the machine or another human. If you knew the probability distribution of text from chat sessions, in particular the distribution of possible questions P(q) and question-answer pairs P(qa), then you could generate answers to questions with distribution P(a|q) = P(qa)/P(q) identical to the average human response. You can generalize this argument to question-answer sequences of arbitrary length by treating all of the dialog prior to the last response as q. A whole host of problems have been informally described as "AI complete": speech recognition, handwriting recognition, natural language query systems, and language translation. It is believed that if you can solve one, you can solve them all. I would add text compression to this list. What they have in common is they require knowing P(t). -- Matt Mahoney
Post Follow-up to this messageben wrote: > hello, > > on the bottom of this page > <http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil > says that, when processing speech sound in order to recognise speech in > it, the sound is split into frequencies and then each frequency stream > of data is compressed "using a variety of mathematical techniques that > reduce the amount of information and emphasize those features of the > speech signal important for recognizing speech". does anyone know, or > could anyone have a good guess at, which compression he's using there? I believe he is referring to 'linear predictive polynomial coefficients' (LP Cs) or 'line spectral pairs' (LSPs). > thanks, ben. -- Phil Frisbie, Jr. Hawk Software http://www.hawksoft.com
Post Follow-up to this messageMatt Mahoney wrote:
> Probably lossy compression: throw out what the ear can't hear. I
> presume that the information important to speech recognition is the
> same information that the human ear/brain can recognize.
small maybe silly point: it is possible though, just possible, that
there is useful information in speech sound that is undetectable by the
human ear -- obviously not essential information, but possibly useful
information. -- you never know. you could use that as a shortcut/cheat
compared with the human ear/brain. just a thought.
>The article
> hints at how to do this: separate the input into frequency bands,
> similar to what the cochlea does. The frequencies important to
speech
> are in the range 300 Hz to 8000 Hz, so you can discard frequencies
> outside this range.
>
> A second aspect related to compression is language modeling.
i don't think the logic/processing is (in this case at least) that
involved and advanced yet -- at this stage that is? this compression
stage, after the sound has been split into various frequencies, seems
to be a pre-processing stage before the serious business gets underway
-- so just a small and early step in the whole analysis. i'm guessing
that the idea behind compressing the data first (after splitting) is to
throw the chaff out -- boil it down, reducing the shear quantity of
data to process but without loosing any (or much) info at all. which i
suppose is the essence of all compression schemes, so i now realise i
think this question pretty much comes down to (generally): what's the
best compression scheme for speech?
because i imagine the goal/point of compressing the speech data before
analysing it is to consolidate it for processing, i thought it was a
slightly different issue than normal compression -- because the
resulting compressed data is going to be further analysed i thought
normally compressed data might not lend itself so well to that, but i
now don't think that's the case. i think a good compression scheme
(efficient and makes it small) is a good compression scheme to use if
you want to further analyse the compressed data.
obviously the best speech compression scheme will make use of a
language model, but i suspect (could easily be wrong) that the
compression ray kurweil was talking about that i quoted was not quite
of that level of compression scheme.
but, if you're in the business of speech compression obviously you'll
have a language model -- so why not make use of that language model in
this early compression stage? a dynamic, growing speech compression
scheme. but i do feel that a much less an amazing speech compressor
would suffice and is being used for the stage that this question is
about.. maybe. not sure.
> Given a
> speech signal s, the problem is to output text t satisfying max_t
> P(t|s), the t such that probability P(t|s) is maximum. By Bayes
law,
> max_t P(t|s) = max_t P(s|t)P(t). P(s|t) is the acoustic model, a
> mapping of text to probable vocalizations. P(t) is the language
model.
> Thus, if P(s|"recognize speech") = P(s|"reckon eyes peach") (they
> sound the same), we can still make the correct translation because
> P("recognize speech") > P("reckon eyes peach").
> What the article does not mention is that determining the language
> model P(t) is exactly the same problem as text compression, which is
to
> assign a code of length log 1/P(t) to text string t.
>
> The coding problem has already been solved (arithmetic coding). But
> finding P(t) is hard. Otherwise we would have text compressors
> achieving 1 bit per character compression ratio, the entropy Shannon
> estimated in 1950 by having humans try to guess successive letters in
> text.
>
> We would also have machines passing the Turing test for AI. Turing
> defined AI as the ability of a machine to answer questions such that
a
> human judge could not tell if the answers came from the machine or
> another human. If you knew the probability distribution of text
from
> chat sessions, in particular the distribution of possible questions
> P(q) and question-answer pairs P(qa), then you could generate answers
> to questions with distribution P(a|q) = P(qa)/P(q) identical to the
> average human response. You can generalize this argument to
> question-answer sequences of arbitrary length by treating all of the
> dialog prior to the last response as q.
>
> A whole host of problems have been informally described as "AI
> complete": speech recognition, handwriting recognition, natural
> language query systems, and language translation. It is believed
that
> if you can solve one, you can solve them all. I would add text
> compression to this list. What they have in common is they require
> knowing P(t).
(just to say, you're talking about the whole speech recognition process
there, and i was specifically talking about the early, small
compression before processing stage that ray kurzweil mentioned -- not
complaining though at all, the above is really interesting and good
stuff. i think the whole speech recognition problem and the compression
that i'm asking about are two very seperate things -- in this case that
is.)
(particularly regarding your last paragraph): yes, i've always thought
that the ultimate compression algorithm is the ultimate
organisational/categorising (store and search and retrieve) algorithm
is the ultimate AI algorithm is the ultimate... probably others (maybe
dsp algorithm, not sure) -- point is they're all the same single
algorithm / logic. all these different fields will have exactly the
same sollution i think.
another small probably silly point: the turing test -- you could just
ask, as the tester, are you a biological human being or not? and if the
asnwer's yes, ask many background, history questions. if a computer was
going to pass the turing test, not only would it have to be reasonably
intelligent (which is obviously the interesting important part) but it
would have to also be able to say a very convincing, not true so far as
itself goes history (which isn't a particularly useful or interesting
goal imo (possible though) -- if you've managed to make a computer
intelligent, making it convincingly persuade people that it was born,
by a human mother, in 1967 in spain in a small village, went to such
and such school etc. (and that etc. represents a lot) would be a
complete waste of time and energy and if you had any sense you wouldn't
bother -- but you would need to bother if your intelligent machine was
going to pass the turing test).
anyway, thanks, very interesting,
ben.
Post Follow-up to this messagePhil Frisbie, Jr. wrote: > I believe he is referring to 'linear predictive polynomial coefficients' (LPCs) > or 'line spectral pairs' (LSPs). yes i had heard of one of those elsewhere in conjunction with speech recognition before. i just looked them up in a compression book and LPC is in there -- i did not realise that they were compression schemes. i thought they were some kind of processing / analysis, which i suppose they are in way, but i didn't realise they were compression methods. and i also now realise that any good speech compression method is a good method to use at the stage that i'm asking about. i did think, but not now, that because the compressed data was going to be further analysed, that a particular compression scheme, one that lends itself well to its data being analysed would be best to use. thanks-a-lot, ben.
Post Follow-up to this messageben wrote: > Phil Frisbie, Jr. wrote: > > > > coefficients' (LPCs) > > > yes i had heard of one of those elsewhere in conjunction with speech > recognition before. i just looked them up in a compression book and LPC > is in there -- i did not realise that they were compression schemes. i > thought they were some kind of processing / analysis, which i suppose > they are in way, but i didn't realise they were compression methods. > and i also now realise that any good speech compression method is a > good method to use at the stage that i'm asking about. i did think, but > not now, that because the compressed data was going to be further > analysed, that a particular compression scheme, one that lends itself > well to its data being analysed would be best to use. LPC is a way to model the human airway, and by doing so you not only get compression but a way to classify the speech patterns for recognition. LPC h as also been used to help the deaf to learn to speak more naturally by graphica lly displaying their speech patterns compared to another persons. > thanks-a-lot, > ben. -- Phil Frisbie, Jr. Hawk Software http://www.hawksoft.com
Post Follow-up to this messageben wrote: > Matt Mahoney wrote: > > > small maybe silly point: it is possible though, just possible, that > there is useful information in speech sound that is undetectable by the > human ear -- obviously not essential information, but possibly useful > information. -- you never know. you could use that as a shortcut/cheat > compared with the human ear/brain. just a thought. It's possible, but the ear/brain adapts (through eveolution and/or learning) to signals that are important. A bat's hearing is most sensitive in the narrow range used by their sonar chirps. We know that the human ear is most sensitive to frequences in the range used by speech. At a higher level, children learn at an early age to distinguish phonemes in their native language, and it difficult to learn this distinction later. For example, native speakers in Chinese and Japanese have difficulty hearing the difference between /l/ and /r/ in English. Some languages in India have 3 distinct forms of /k/ that sound alike to English speakers. I suspect that what Kurtzweil means by compression is feature extraction (frequency bands, formants, phonemes), and it just happens that this higher level representation has enough information to reconstruct something that sounds like the original speech. -- Matt Mahoney
Post Follow-up to this messageben wrote: > another small probably silly point: the turing test -- you could just > ask, as the tester, are you a biological human being or not? and if the > asnwer's yes, ask many background, history questions. if a computer was > going to pass the turing test, not only would it have to be reasonably > intelligent (which is obviously the interesting important part) but it > would have to also be able to say a very convincing, not true so far as > itself goes history (which isn't a particularly useful or interesting > goal imo (possible though) -- if you've managed to make a computer > intelligent, making it convincingly persuade people that it was born, > by a human mother, in 1967 in spain in a small village, went to such > and such school etc. (and that etc. represents a lot) would be a > complete waste of time and energy and if you had any sense you wouldn't > bother -- but you would need to bother if your intelligent machine was > going to pass the turing test). That is a problem. In Turing's 1950 article in Mind, he gave a hypothetical example of a conversation in which the interrogator gives the subject (we don't know if it is machine or human) an arithetic problem, and after a 30 second delay it gives the wrong answer. I am sure that Turing was aware that replicating human behavior is not very useful. Nevertheless, nobody has come up with a better definition of AI since then. Another interesting tidbit is that Turing predicted in the same article that a machine with 10^9 bits of memory, but no faster than current hardware at that time would solve the AI problem in 2000. So far nobody has won the Loebner prize, but his prediction about the cost of memory was remarkably accurate considering that it predated Moore's law by about 15 years. Turing did not say how he arrived at 10^9 bits, but he did suggest a machine learning approach, and it might be that 10^9 bits is about the information content of all the speech and writing that you process in a lifetime. Here is another clue. There were two important events in 1949. First, Shannon invented information theory, and second, Hebb proposed a (now accepted) model of learning in neurons. -- Matt Mahoney
Post Follow-up to this message
Show a Printable Version
Email This Page to Someone!
Receive updates to this thread
Powered by vBulletin
Copyright 2000-2006 Jelsoft Enterprises Limited.