For Programmers: Free Programming Magazines  


Home > Archive > Compression > April 2005 > compression used to emphasise differences in sound data









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author compression used to emphasise differences in sound data
ben

2005-04-20, 8:55 pm

hello,

on the bottom of this page
<http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil
says that, when processing speech sound in order to recognise speech in
it, the sound is split into frequencies and then each frequency stream
of data is compressed "using a variety of mathematical techniques that
reduce the amount of information and emphasize those features of the
speech signal important for recognizing speech". does anyone know, or
could anyone have a good guess at, which compression he's using there?

thanks, ben.

Matt Mahoney

2005-04-21, 3:55 pm

ben wrote:
> hello,
>
> on the bottom of this page
> <http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil
> says that, when processing speech sound in order to recognise speech

in
> it, the sound is split into frequencies and then each frequency

stream
> of data is compressed "using a variety of mathematical techniques

that
> reduce the amount of information and emphasize those features of the
> speech signal important for recognizing speech". does anyone know, or
> could anyone have a good guess at, which compression he's using

there?
>
> thanks, ben.


Probably lossy compression: throw out what the ear can't hear. I
presume that the information important to speech recognition is the
same information that the human ear/brain can recognize. The article
hints at how to do this: separate the input into frequency bands,
similar to what the cochlea does. The frequencies important to speech
are in the range 300 Hz to 8000 Hz, so you can discard frequencies
outside this range.

A second aspect related to compression is language modeling. Given a
speech signal s, the problem is to output text t satisfying max_t
P(t|s), the t such that probability P(t|s) is maximum. By Bayes law,
max_t P(t|s) = max_t P(s|t)P(t). P(s|t) is the acoustic model, a
mapping of text to probable vocalizations. P(t) is the language model.
Thus, if P(s|"recognize speech") = P(s|"reckon eyes peach") (they
sound the same), we can still make the correct translation because
P("recognize speech") > P("reckon eyes peach").

What the article does not mention is that determining the language
model P(t) is exactly the same problem as text compression, which is to
assign a code of length log 1/P(t) to text string t.

The coding problem has already been solved (arithmetic coding). But
finding P(t) is hard. Otherwise we would have text compressors
achieving 1 bit per character compression ratio, the entropy Shannon
estimated in 1950 by having humans try to guess successive letters in
text.

We would also have machines passing the Turing test for AI. Turing
defined AI as the ability of a machine to answer questions such that a
human judge could not tell if the answers came from the machine or
another human. If you knew the probability distribution of text from
chat sessions, in particular the distribution of possible questions
P(q) and question-answer pairs P(qa), then you could generate answers
to questions with distribution P(a|q) = P(qa)/P(q) identical to the
average human response. You can generalize this argument to
question-answer sequences of arbitrary length by treating all of the
dialog prior to the last response as q.

A whole host of problems have been informally described as "AI
complete": speech recognition, handwriting recognition, natural
language query systems, and language translation. It is believed that
if you can solve one, you can solve them all. I would add text
compression to this list. What they have in common is they require
knowing P(t).

-- Matt Mahoney

Phil Frisbie, Jr.

2005-04-21, 3:55 pm

ben wrote:

> hello,
>
> on the bottom of this page
> <http://mitpress.mit.edu/e-books/Hal/chap7/seven11.html> ray kurzweil
> says that, when processing speech sound in order to recognise speech in
> it, the sound is split into frequencies and then each frequency stream
> of data is compressed "using a variety of mathematical techniques that
> reduce the amount of information and emphasize those features of the
> speech signal important for recognizing speech". does anyone know, or
> could anyone have a good guess at, which compression he's using there?


I believe he is referring to 'linear predictive polynomial coefficients' (LPCs)
or 'line spectral pairs' (LSPs).

> thanks, ben.


--
Phil Frisbie, Jr.
Hawk Software
http://www.hawksoft.com

ben

2005-04-22, 3:56 pm

Matt Mahoney wrote:

> Probably lossy compression: throw out what the ear can't hear. I
> presume that the information important to speech recognition is the
> same information that the human ear/brain can recognize.


small maybe silly point: it is possible though, just possible, that
there is useful information in speech sound that is undetectable by the
human ear -- obviously not essential information, but possibly useful
information. -- you never know. you could use that as a shortcut/cheat
compared with the human ear/brain. just a thought.

>The article
> hints at how to do this: separate the input into frequency bands,
> similar to what the cochlea does. The frequencies important to

speech
> are in the range 300 Hz to 8000 Hz, so you can discard frequencies
> outside this range.
>
> A second aspect related to compression is language modeling.


i don't think the logic/processing is (in this case at least) that
involved and advanced yet -- at this stage that is? this compression
stage, after the sound has been split into various frequencies, seems
to be a pre-processing stage before the serious business gets underway
-- so just a small and early step in the whole analysis. i'm guessing
that the idea behind compressing the data first (after splitting) is to
throw the chaff out -- boil it down, reducing the shear quantity of
data to process but without loosing any (or much) info at all. which i
suppose is the essence of all compression schemes, so i now realise i
think this question pretty much comes down to (generally): what's the
best compression scheme for speech?

because i imagine the goal/point of compressing the speech data before
analysing it is to consolidate it for processing, i thought it was a
slightly different issue than normal compression -- because the
resulting compressed data is going to be further analysed i thought
normally compressed data might not lend itself so well to that, but i
now don't think that's the case. i think a good compression scheme
(efficient and makes it small) is a good compression scheme to use if
you want to further analyse the compressed data.

obviously the best speech compression scheme will make use of a
language model, but i suspect (could easily be wrong) that the
compression ray kurweil was talking about that i quoted was not quite
of that level of compression scheme.

but, if you're in the business of speech compression obviously you'll
have a language model -- so why not make use of that language model in
this early compression stage? a dynamic, growing speech compression
scheme. but i do feel that a much less an amazing speech compressor
would suffice and is being used for the stage that this question is
about.. maybe. not sure.

> Given a
> speech signal s, the problem is to output text t satisfying max_t
> P(t|s), the t such that probability P(t|s) is maximum. By Bayes

law,
> max_t P(t|s) = max_t P(s|t)P(t). P(s|t) is the acoustic model, a
> mapping of text to probable vocalizations. P(t) is the language

model.
> Thus, if P(s|"recognize speech") = P(s|"reckon eyes peach") (they
> sound the same), we can still make the correct translation because
> P("recognize speech") > P("reckon eyes peach").


> What the article does not mention is that determining the language
> model P(t) is exactly the same problem as text compression, which is

to
> assign a code of length log 1/P(t) to text string t.
>
> The coding problem has already been solved (arithmetic coding). But


> finding P(t) is hard. Otherwise we would have text compressors
> achieving 1 bit per character compression ratio, the entropy Shannon
> estimated in 1950 by having humans try to guess successive letters in


> text.
>
> We would also have machines passing the Turing test for AI. Turing
> defined AI as the ability of a machine to answer questions such that

a
> human judge could not tell if the answers came from the machine or
> another human. If you knew the probability distribution of text

from
> chat sessions, in particular the distribution of possible questions
> P(q) and question-answer pairs P(qa), then you could generate answers


> to questions with distribution P(a|q) = P(qa)/P(q) identical to the
> average human response. You can generalize this argument to
> question-answer sequences of arbitrary length by treating all of the
> dialog prior to the last response as q.
>
> A whole host of problems have been informally described as "AI
> complete": speech recognition, handwriting recognition, natural
> language query systems, and language translation. It is believed

that
> if you can solve one, you can solve them all. I would add text
> compression to this list. What they have in common is they require
> knowing P(t).


(just to say, you're talking about the whole speech recognition process
there, and i was specifically talking about the early, small
compression before processing stage that ray kurzweil mentioned -- not
complaining though at all, the above is really interesting and good
stuff. i think the whole speech recognition problem and the compression
that i'm asking about are two very seperate things -- in this case that
is.)

(particularly regarding your last paragraph): yes, i've always thought
that the ultimate compression algorithm is the ultimate
organisational/categorising (store and search and retrieve) algorithm
is the ultimate AI algorithm is the ultimate... probably others (maybe
dsp algorithm, not sure) -- point is they're all the same single
algorithm / logic. all these different fields will have exactly the
same sollution i think.

another small probably silly point: the turing test -- you could just
ask, as the tester, are you a biological human being or not? and if the
asnwer's yes, ask many background, history questions. if a computer was
going to pass the turing test, not only would it have to be reasonably
intelligent (which is obviously the interesting important part) but it
would have to also be able to say a very convincing, not true so far as
itself goes history (which isn't a particularly useful or interesting
goal imo (possible though) -- if you've managed to make a computer
intelligent, making it convincingly persuade people that it was born,
by a human mother, in 1967 in spain in a small village, went to such
and such school etc. (and that etc. represents a lot) would be a
complete waste of time and energy and if you had any sense you wouldn't
bother -- but you would need to bother if your intelligent machine was
going to pass the turing test).

anyway, thanks, very interesting,

ben.

ben

2005-04-22, 3:56 pm

Phil Frisbie, Jr. wrote:

> I believe he is referring to 'linear predictive polynomial

coefficients' (LPCs)
> or 'line spectral pairs' (LSPs).


yes i had heard of one of those elsewhere in conjunction with speech
recognition before. i just looked them up in a compression book and LPC
is in there -- i did not realise that they were compression schemes. i
thought they were some kind of processing / analysis, which i suppose
they are in way, but i didn't realise they were compression methods.
and i also now realise that any good speech compression method is a
good method to use at the stage that i'm asking about. i did think, but
not now, that because the compressed data was going to be further
analysed, that a particular compression scheme, one that lends itself
well to its data being analysed would be best to use.

thanks-a-lot,
ben.

Phil Frisbie, Jr.

2005-04-22, 3:56 pm

ben wrote:

> Phil Frisbie, Jr. wrote:
>
>
>
> coefficients' (LPCs)
>
>
> yes i had heard of one of those elsewhere in conjunction with speech
> recognition before. i just looked them up in a compression book and LPC
> is in there -- i did not realise that they were compression schemes. i
> thought they were some kind of processing / analysis, which i suppose
> they are in way, but i didn't realise they were compression methods.
> and i also now realise that any good speech compression method is a
> good method to use at the stage that i'm asking about. i did think, but
> not now, that because the compressed data was going to be further
> analysed, that a particular compression scheme, one that lends itself
> well to its data being analysed would be best to use.


LPC is a way to model the human airway, and by doing so you not only get
compression but a way to classify the speech patterns for recognition. LPC has
also been used to help the deaf to learn to speak more naturally by graphically
displaying their speech patterns compared to another persons.

> thanks-a-lot,
> ben.


--
Phil Frisbie, Jr.
Hawk Software
http://www.hawksoft.com

Matt Mahoney

2005-04-22, 3:56 pm

ben wrote:
> Matt Mahoney wrote:
>
>
> small maybe silly point: it is possible though, just possible, that
> there is useful information in speech sound that is undetectable by

the
> human ear -- obviously not essential information, but possibly useful
> information. -- you never know. you could use that as a

shortcut/cheat
> compared with the human ear/brain. just a thought.


It's possible, but the ear/brain adapts (through eveolution and/or
learning) to signals that are important. A bat's hearing is most
sensitive in the narrow range used by their sonar chirps. We know that
the human ear is most sensitive to frequences in the range used by
speech. At a higher level, children learn at an early age to
distinguish phonemes in their native language, and it difficult to
learn this distinction later. For example, native speakers in Chinese
and Japanese have difficulty hearing the difference between /l/ and /r/
in English. Some languages in India have 3 distinct forms of /k/ that
sound alike to English speakers.

I suspect that what Kurtzweil means by compression is feature
extraction (frequency bands, formants, phonemes), and it just happens
that this higher level representation has enough information to
reconstruct something that sounds like the original speech.

-- Matt Mahoney

Matt Mahoney

2005-04-22, 8:55 pm

ben wrote:
> another small probably silly point: the turing test -- you could just
> ask, as the tester, are you a biological human being or not? and if

the
> asnwer's yes, ask many background, history questions. if a computer

was
> going to pass the turing test, not only would it have to be

reasonably
> intelligent (which is obviously the interesting important part) but

it
> would have to also be able to say a very convincing, not true so far

as
> itself goes history (which isn't a particularly useful or interesting
> goal imo (possible though) -- if you've managed to make a computer
> intelligent, making it convincingly persuade people that it was born,
> by a human mother, in 1967 in spain in a small village, went to such
> and such school etc. (and that etc. represents a lot) would be a
> complete waste of time and energy and if you had any sense you

wouldn't
> bother -- but you would need to bother if your intelligent machine

was
> going to pass the turing test).


That is a problem. In Turing's 1950 article in Mind, he gave a
hypothetical example of a conversation in which the interrogator gives
the subject (we don't know if it is machine or human) an arithetic
problem, and after a 30 second delay it gives the wrong answer. I am
sure that Turing was aware that replicating human behavior is not very
useful. Nevertheless, nobody has come up with a better definition of
AI since then.

Another interesting tidbit is that Turing predicted in the same article
that a machine with 10^9 bits of memory, but no faster than current
hardware at that time would solve the AI problem in 2000. So far
nobody has won the Loebner prize, but his prediction about the cost of
memory was remarkably accurate considering that it predated Moore's law
by about 15 years.

Turing did not say how he arrived at 10^9 bits, but he did suggest a
machine learning approach, and it might be that 10^9 bits is about the
information content of all the speech and writing that you process in a
lifetime. Here is another clue. There were two important events in
1949. First, Shannon invented information theory, and second, Hebb
proposed a (now accepted) model of learning in neurons.

-- Matt Mahoney

Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com