FRONT PAGE
CONTACTS |
Katherine Morton - RESEARCH PROFILE
Suffolk Daffodils
[key words:
expression, 'emotive content of spoken language',
linguistics, biopsychology, psychoacoustics,
'acoustic phonetics']
RESEARCH INTEREST
What are we saying when we say 'Hello'? These two syllables, 4 separate sounds, are a greeting. They are an
acknowledgement of meeting someone, of letting the other person know the
speaker is open to the beginnings of a conversation.
But this word also can be spoken in a number of different ways; the
emotive content can inform the listener that the speaker is happy, bored,
in love, irritated, perplexed, in a hurry, frightened. All this
information can be conveyed by adding other words to 'hello' such as
'hello, I'm happy to see you again', but most speakers convey their
feeling and attitude by subtle alterations in the manner of speaking,
sometimes called tone-of-voice.
Thus, spoken language contains other information than the plain message
expressed by the words, syllables, and segments. My current work addresses
the question: What is this other information and how might it be described
and modeled?
BASIC INTEREST
My basic interest is in the relationship between cognitive processing and the
physical realization of the result of that processing - the translation of
cognition into action. The specific research has focused on modeling one
aspect of human language behavior - what we do when we produce and
perceive spoken language. Both speaking and understanding the message are
mediated by biological and cognitive systems.
That is: when we think of something to say, speakers draw on their
knowledge of the language, configure the vocal tract to produce the
appropriate sounds: listeners detect these sounds, interpret them
according to the rules of the language, and assign a meaning. Ideally, the
resulting concept for the listener is similar to the speaker's intention.
This process involves cognitive and physiological activity on the part of
both speaker and listener. These activities are associated in a coherent
way, otherwise we would not be able to communicate.
RESEARCH HISTORY - a brief statement
-
EARLY work consisted of modeling what we know about some phonological
aspects of language correlated with the physical expression of that
knowledge (i.e. moving the articulators). The work is summarized below.
This work was extended to looking at the resulting acoustic wave shape;
the aim was to relate some aspects of the cognitive intention to speak, of
motor control realizing the intention, and of
the resulting acoustic
features.
-
This led to CURRENT work which is concerned with modelling the
communication of the emotion accompanying the plain linguistic message.
Both the linguistic message and the emotional coloring are encoded in the
speech waveform. Listeners detect and interpret the plain linguistic
message according to their knowledge of the language, whilst also being
aware of emotion overtones. This work is summarized below.
SUMMARY OF RESEARCH - a fuller statement
-
EARLY WORK: Models of
spoken language production
The work was based on proposing a relationship between
-
linguistic knowledge which describes
what speakers know about their language in order to produce an
utterance and
-
descriptions of the properties of
the motor control system and configuration of the vocal tract to
produce the intended spoken language.
Experimental design centered on the question type: how can linguistic
knowledge be implemented within the constraints imposed by the biological
system? To this end, experimental work was conducted using emg,
aerodynamic, and other techniques. The work led to development of a model
of motor control directed toward describing gestures associated with the
linguistic intention of producing specific repeatable and unambiguous
segments and syllables.
Research effort moved from looking at physiological variables to looking
at the resulting acoustic variables in the speech waveform. The model was
tested using early speech synthesis. Later, neural networks were used to
simulate production and perception. This led into development of the PALM
project (Psycho-Acoustic Language Modeling) which underlies the current
work.
-
CURRENT WORK: Models
of emotive spoken language
In spoken language, the basic message with the necessary linguistic
information content is conveyed by the words and their arrangement within
the sentence. However, speakers and listeners have feelings about one
another, attitudes about what we are saying, and attitudes toward the
listener.
We are all aware that humans speak with emotive content, with expression
of some amount of feeling. Feelings expressed in speech may be mild, or so
strong they elicit strong feelings from the listener. That is, remove
emotive content, and what's left is the plain message with none of the
feelings which surround all speech. Some researchers claim and some
speakers report 'neutral' speech. But others maintain that neutral does in
fact convey an impression - perhaps minimal emotive content, perhaps an
awareness of disinterest or calm, but not zero emotive content.
And as a result of awareness of emotive content, we often make judgments
about others not by WHAT is said but HOW it is said: by the tone-of-voice.
The research objective is to model those aspects of spoken language which
carry emotive content. The first stage is to determine some of the
essential prosodic information that cues perception in the listener of the
emotional overtones and attitudes conveyed by the speaker and correlate
this information with the acoustic wave and with reports from listeners.
Building on the work in PALM, I am looking at variations in the acoustic
wave which appear to produce emotive perceptual effects in the listener.
The broad question is: how do we communicate on non-linguistic levels
within the constraints imposed by the biological and cognitive system. I
call the approach 'Pragmatic Phonetics'.
For example, given the utterance: 'hello',
how many ways are there of saying this?
This simple utterance can sound happy, sad, irritated, sardonic, fearful,
confident and so forth. The emotive effects are achieved by manipulating
the vocal tract to produce an acoustic waveform that results in perceived
intonation, rhythm, and loudness of the syllables, words, and phrases.
Listeners also perceive subtle interactions of all three parameters.
The standard model of emotive spoken language divides the utterance into
two parts:
-
a description of the the plain message described by linguistics
(minimally by semantics, phonology, syntax) and
-
a description of emotive content in terms of varying intonation,,
increased or decreased energy on stress patterns, and increased or
decreased loudness of the utterance. (The acoustic correlates are:
fundamental frequency variation, amplitude changes, timing changes, timing
variation.)
My approach is to not to work within these two divisions but to consider
the plain message as wrapped in an emotive environment. The speaker's
physiological setting produces changes in the vocal tract which give rise
to variations in the acoustic waveform. These changes are detected and
interpreted by the listener. Emotive parameters are variables with a range
of numerical values associated with these parameters.
Thus emotion is not added to the basic message but occurs across the
entire utterance. This approach will allow for changes in the numerical
values of the emotive variables as the utterance progresses and will deal
with changing emotive effects within a dialog. I call this Pragmatic
Phonetics.
MORE DETAIL ON SPOKEN LANGUAGE MODELLING
-
What does language do?
It is generally agreed there are four functions of spoken language:
-
to convey information
-
to provide explanation
-
to provide commentary
-
to express attitude and emotion.
The first three functions convey the plain message which is communicated
through written language as well as spoken language. The fourth conveys
the attitude and feeling state of the speaker which is mainly through
speaking. The standard model claims emotive content is overlaid on the
plain message, which itself is regarded as having non-emotive content,
often labelled 'neutral' expression.
-
How do we model expressive speech?
-
The Standard Model
The problem is: how can we model emotive content associated with an
abstract derived sentence. How can the output of the phonological
processing component be modified? At what point could the associated
emotive content be accounted for in spoken language?
Initially, my work was within a standard framework derived from the
multilayered traditional generative grammar approach. In this model, the
phonological component specifies the overall abstract sound pattern of the
word or sentence within each language. The speaker puts together an
abstract idealized prosodic and segmental specification of the intended
utterance - called the utterance plan. Emotive content is not part of the
linguistic model.
To construct an account of emotive effects, the standard approach is to
set up two input channels; one, based on the linguistic description.
conveying text input (the plain message) output from the phonology and a
second channel consisting of a sequence of emotive markers synchronized
with the text. Emotive effects are added to the plain message. The aim of
my work, along with others, was to ask: how observed emotive speech
effects might be modeled as an extension to the non-emotive prosodic
effects of intonation and rhythm found in phonological descriptions.
With some differences in method and the choice of basic units, my early
work in Pragmatic Phonetics followed the principles of the standard model:
modifications to the assignment of fundamental frequency changes and
duration of syllables are systematically overlaid on the 'neutral'
representation. Emotive effects are described as rule-governed departures
from neutral contours which are the default values produced by an
idealized neutral production system. Pragmatic rules were activated when
pragmatic expression markers were generated at a higher level in the
language understanding section of the conversation/discourse model.
Data
The data was derived from human speech, in which variations in the
acoustic parameters fundamental frequency, syllable timing and amplitude
are assumed to provide information to the listener about attitude and
emotion. These changing values become incorporated in the acoustic
realization of the phonological output of the linguistic sentence to
produce utterances intended to trigger the perceptual effect of emotion
and attitude. In common with many researchers, the model was stated in
computational terms so that its coherence could be tested using a speech
synthesizer.
Testing the model
Many researchers test models using a speech synthesizer to generate
sample utterances containing emotive content generated by rule.
Synthetic speech without emotive content is described as flat or
mechanical and some researchers label this output 'neutral'. To
produce an waveform which simulates speech, specific fundamental
frequency and timing values are assigned correlated with the abstract
pattern specified by phonology. The fundamental frequency values,
which the listener perceives as intonation patterns, and timing of
words and syllables are manipulated by rule to produce a perceived
effect of emotive content. (some systems also allow manipulation of
amplitude.
-
PRAGMATIC PHONETICS - a departure from the Standard Model
It is essential to distinguish between what the speaker does, and how it
is modeled. Models by definition are abstractions; a model is a set of
generalizations that put forward hypotheses about the phenomenon being
researched. In this case, the speaker/listener and the flow of information
about emotion/attitude are being modeled.
In the research literature on emotive content of utterances, the word
'neutral' has been used two ways: as a term in a model and as a
description of what a speaker does. As a term within the model, 'neutral'
describes an abstract non- emotive content from which other emotions have
been mapped. In the case of speaking style, 'neutral' is used to
characterize one of the results of what the speaker does - produce
non-emotive speech - and is equivalent in status to other emotive styles
such as 'happy'. Some researchers also map from 'neutral', meaning no
emotion, to the other emotions on the same level without reference to an
underlying abstract concept 'neutral'. This can be confusing.
In the approach I suggest, the pragmatic modelling of expression, the word
neutral is used in the second way, to characterize emotive content in the
utterance. Neutral is seen as an expression on the same level as other
expressions such as happy, sad, etc. Since the emotion expressions have
equivalent status, emotive content is not mapped from neutral to the other
emotion expressions. The content of the phonological plan is seen as
placed within the speaker's overall emotive stance on each occasion. There
is no mapping between or within descriptive levels. Emotive content is
described as varying within sliding scales on labelled sections of the
speech waveform.
The Pragmatic model requires syllables or words to be marked with respect
to a range of possible prosodic values and markers assigned to larger
units such as phrases. Emotion features select within the range of
acceptable possibilities and stretch or compress these ranges. The
abstract values with the range of possibilities is correlated with
real-world acoustic values.
In this model, all spoken language is assumed to be affected by the
physiological setting of the speaker. What can been regarded as
non-emotive speech, triggering a perception of little or no emotion, is
produced by a speaker whose physiological settings are at a low level of
excitation or a high level of inhibition. Angry speaking results from
variable tension in the physiological system - the tension results in
motor control patterns that vary from the settings for, say, happy
speaking. Vocal tract configuration varies as the speaker's muscular
tension varies.
Changes in vocal tract innervation and shape
result in changes which appear in the speech waveform: the listener is
able to detect these changes and usually interprets them adequately. Listeners sometimes
get it wrong - then the listener may ask the speaker: 'Do you intend to
sound annoyed?' or the listener may ask for corroboration 'I can tell
you've had good news', or be uncertain 'I don't think you meant to say
that, did you?', etc.
In summary, the system works like this: the speaker knows the the sound
pattern of the language; this is described by linguistics (the
phonological component). In the speaker, the phonological plan for the
utterance is assembled at a cognitive level. The appropriate prosodic
effects are assigned. It is possible to also intend to express emotion in
a particular way - to add, to subtract, to repress emotive content,
although generally in conversation, we don't consciously plan to speak in
a particular way. Well known exceptions are newsreaders, lecturers,
suppressing irritation in an argument. Whether planned or as a result of
the physiological setting of the speaker, emotive effects appear in the
waveform. The listener can detect many of the intended changes, as well as
the unintended ones.
The task for the researcher is to describe these changes. Furthermore,
speakers produce utterances with ongoing variability. That is, emotive
dominated changes can occur within an utterance. The speaker can begin a
sentence calmly and end it laughing, crying, or raging. This observed
ongoing variability cannot be easily modeled in the Standard Model where
emotive content is simply added to the plain message.
Testing the model
The results of the research are described by a computational model, and
tested using speech synthesis. Synthesis allows testing the coherence of
the model and provides a stimuli set for testing listener's responses in
an experimental setting. The Pragmatic Phonetic utterance model is
evaluated by correlating the acoustic changes of fundamental frequency and
duration with perceived differences reported by panels of listeners.
[I should
emphasize that synthesis is used to test the model; this work is not
designed as an engineering implementation to improve synthetic speech.
To include a pragmatics module in a computationally adequate model for
synthesis, it is necessary to have not just an understanding of how the
acoustic signal must be made to vary, but also a formal, theoretically
based, means of triggering the varying acoustic signal at the right
moment. The theoretical underpinning is incomplete, which is
one reason why speech synthesis systems are
poor simulations of natural speech.]
THE FUTURE
Work continues on modifying the current Pragmatic Phonetic Model to
describe emotive content. Data gathering from instances of utterances
judged to have perceptible emotive content is continuing. The model
continues to be computationally adequate, and is implemented on various
speech synthesis systems to produce stimuli for testing. Work on rhythm
generation, and comparison with models of other types of emotive
expression such as face-emotion have begun.
The work is closely associated with the development of Cognitive Phonetics
(M. Tatham). Tatham 's work is about modelling the cognitive control of
the neuromuscular system resulting in intended variation in the acoustic
wave. The principles of Cognitive Phonetics are applicable to emotive
modelling which assumes underlying biological and cognitive components and
assumes intended and unintentional emotive effects in the utterance.
Palmyra Katherine Morton A.B. Ph.D.
email:
katherine.morton@btconnect.com
|