Expression in Speech: synopsis

Katherine Morton and Mark Tatham

Expression in Speech: Natural and Synthetic
Oxford: Oxford University Press [2004]

hardback: ISBN 0199250677

www.oup.co.uk/isbn/0-19-925067-7

What is Natural Expression?

All human speech has expression. We recognise it as part of the humanness of speech, and it is a quality listeners expect to find in human communication. Without expression speech sounds lifeless and artificial. Remove expression, and what's left is the bare bones of the intended message, but none of the feelings which surround the message.

The plain message, the bare bones of speech, is conveyed by the words and how they are arranged into sentences. The semantic content of words and the relationship between them is described by semantics and syntax - part of the discipline of linguistics. This basic message conveys information effectively. But speakers and listeners are not machines; we have feelings toward one another, and attitudes toward what we are saying and listening to.

These feelings can be so mild we say they are fairly neutral, but they also can be strong to the point of concern or indifference, humour or grief, boredom or obsession. No act of speech escapes expressive content, no message escapes being coloured by emotion or attitude.

Our sensitivity to expression is so keen that much of the time we know immediately when the emotional stance changes - say, when someone we meet and like responds to us. Or during an argument we can usually tell when someone agrees or is preparing to disagree. We can feel a person warm to us not so much by what they say, but how they say it.

This rapid adaptation is amazing, considering that the average utterance is said to be just a couple of seconds long, although much speech is so brief each utterance can be measured in milliseconds. What are we listening to? What are we detecting and perceiving? We outline some of the results of work in spoken language currently being done to address these questions.

Speech is not the only form of expression. Expressive communication can take the form of speaking, writing, singing, gardening, and designing buildings, stage sets, decorating the sitting room, gardening, painting or crying, smiling, hugging a person, hurling an object, patting a dog, and so on. We say about all these activities that we are expressing ourselves.

Expression can therefore refer to a style of communication. It may be said that the purpose of expression is to communicate some aspect of our internal world to someone else. This need not be under conscious control. For example, we can reveal or betray emotion by face or body movements, or by, for example, contrastive emphasis in speech.

We can also communicate with ourselves - that is a major matter covered extensively by branches of psychology and philosophy, neuroscience and novelists - and is not addressed in this book.

Can Expression be Simulated in Computer Speech?

The purpose of this book is to present research results on looking into expressive content in speech with only one purpose: can the success of human communication be simulated in human-computer communication.

Human beings communicate expressively with each other in conversation. Synthetic speech which does not have expressive content is not a true simulation of human speech because the simulation is about part of the speech, the plain message only. Now, in the computer age, there is a perceived need for machines to communicate rather than simply pass information. These systems will not be acceptable in applications of speech synthesis technology where more than just the plain message is needed.

What do we need for computer simulation?

We need to collect the right kind of data, and build suitable models that describe human expressive communication in such a way that the resulting model is computationally adequate. There are two approaches:

generalize from a large data base of expressive speech, and replicate these generalizations in the sound wave;
establish what in speech cues the perception in the listener of the expressive content the speaker, and ultimately the machine, intends or needs to convey.

The book is about the results of research seeking to define expression in speech; it is about the research attempting to develop computer speech with expression; it is also about suggesting ways forward.

We see the major problem as one of incomplete model building. Researchers still do not have an adequate model of human speech communication. Therefore attempts to build credible sounding machine systems will fail to some degree.

We outline the major problems that have proved difficult to resolve satisfactorily, and comment on work currently being published in the field. We point out areas of possible improvement within the current paradigm, and suggest some theoretical considerations which, if implemented, might help point toward a more successful solution to serviceable computer speech.