Kate Morton's research

Katherine Morton and Mark Tatham

Katherine Morton - RESEARCH PROFILE

Suffolk Daffodils

[key words: expression, 'emotive content of spoken language', linguistics, biopsychology, psychoacoustics, 'acoustic phonetics']

RESEARCH INTEREST

What are we saying when we say 'Hello'? These two syllables, 4 separate sounds, are a greeting. They are an acknowledgement of meeting someone, of letting the other person know the speaker is open to the beginnings of a conversation.

But this word also can be spoken in a number of different ways; the emotive content can inform the listener that the speaker is happy, bored, in love, irritated, perplexed, in a hurry, frightened. All this information can be conveyed by adding other words to 'hello' such as 'hello, I'm happy to see you again', but most speakers convey their feeling and attitude by subtle alterations in the manner of speaking, sometimes called tone-of-voice.

Thus, spoken language contains other information than the plain message expressed by the words, syllables, and segments. My current work addresses the question: What is this other information and how might it be described and modeled?

BASIC INTEREST

My basic interest is in the relationship between cognitive processing and the physical realization of the result of that processing - the translation of cognition into action. The specific research has focused on modeling one aspect of human language behavior - what we do when we produce and perceive spoken language. Both speaking and understanding the message are mediated by biological and cognitive systems.

That is: when we think of something to say, speakers draw on their knowledge of the language, configure the vocal tract to produce the appropriate sounds: listeners detect these sounds, interpret them according to the rules of the language, and assign a meaning. Ideally, the resulting concept for the listener is similar to the speaker's intention. This process involves cognitive and physiological activity on the part of both speaker and listener. These activities are associated in a coherent way, otherwise we would not be able to communicate.

RESEARCH HISTORY - a brief statement

EARLY work consisted of modeling what we know about some phonological aspects of language correlated with the physical expression of that knowledge (i.e. moving the articulators). The work is summarized below.

This work was extended to looking at the resulting acoustic wave shape; the aim was to relate some aspects of the cognitive intention to speak, of motor control realizing the intention, and of the resulting acoustic features.

This led to CURRENT work which is concerned with modelling the communication of the emotion accompanying the plain linguistic message. Both the linguistic message and the emotional coloring are encoded in the speech waveform. Listeners detect and interpret the plain linguistic message according to their knowledge of the language, whilst also being aware of emotion overtones. This work is summarized below.

SUMMARY OF RESEARCH - a fuller statement

EARLY WORK: Models of spoken language production

The work was based on proposing a relationship between

linguistic knowledge which describes what speakers know about their language in order to produce an utterance and

descriptions of the properties of the motor control system and configuration of the vocal tract to produce the intended spoken language.

Experimental design centered on the question type: how can linguistic knowledge be implemented within the constraints imposed by the biological system? To this end, experimental work was conducted using emg, aerodynamic, and other techniques. The work led to development of a model of motor control directed toward describing gestures associated with the linguistic intention of producing specific repeatable and unambiguous segments and syllables.

Research effort moved from looking at physiological variables to looking at the resulting acoustic variables in the speech waveform. The model was tested using early speech synthesis. Later, neural networks were used to simulate production and perception. This led into development of the PALM project (Psycho-Acoustic Language Modeling) which underlies the current work.

CURRENT WORK: Models of emotive spoken language

In spoken language, the basic message with the necessary linguistic information content is conveyed by the words and their arrangement within the sentence. However, speakers and listeners have feelings about one another, attitudes about what we are saying, and attitudes toward the listener.

We are all aware that humans speak with emotive content, with expression of some amount of feeling. Feelings expressed in speech may be mild, or so strong they elicit strong feelings from the listener. That is, remove emotive content, and what's left is the plain message with none of the feelings which surround all speech. Some researchers claim and some speakers report 'neutral' speech. But others maintain that neutral does in fact convey an impression - perhaps minimal emotive content, perhaps an awareness of disinterest or calm, but not zero emotive content.

And as a result of awareness of emotive content, we often make judgments about others not by WHAT is said but HOW it is said: by the tone-of-voice.

The research objective is to model those aspects of spoken language which carry emotive content. The first stage is to determine some of the essential prosodic information that cues perception in the listener of the emotional overtones and attitudes conveyed by the speaker and correlate this information with the acoustic wave and with reports from listeners.

Building on the work in PALM, I am looking at variations in the acoustic wave which appear to produce emotive perceptual effects in the listener. The broad question is: how do we communicate on non-linguistic levels within the constraints imposed by the biological and cognitive system. I call the approach 'Pragmatic Phonetics'.

For example, given the utterance: 'hello', how many ways are there of saying this?

This simple utterance can sound happy, sad, irritated, sardonic, fearful, confident and so forth. The emotive effects are achieved by manipulating the vocal tract to produce an acoustic waveform that results in perceived intonation, rhythm, and loudness of the syllables, words, and phrases. Listeners also perceive subtle interactions of all three parameters.

The standard model of emotive spoken language divides the utterance into two parts:

a description of the the plain message described by linguistics (minimally by semantics, phonology, syntax) and

a description of emotive content in terms of varying intonation,, increased or decreased energy on stress patterns, and increased or decreased loudness of the utterance. (The acoustic correlates are: fundamental frequency variation, amplitude changes, timing changes, timing variation.)

My approach is to not to work within these two divisions but to consider the plain message as wrapped in an emotive environment. The speaker's physiological setting produces changes in the vocal tract which give rise to variations in the acoustic waveform. These changes are detected and interpreted by the listener. Emotive parameters are variables with a range of numerical values associated with these parameters.

Thus emotion is not added to the basic message but occurs across the entire utterance. This approach will allow for changes in the numerical values of the emotive variables as the utterance progresses and will deal with changing emotive effects within a dialog. I call this Pragmatic Phonetics.

MORE DETAIL ON SPOKEN LANGUAGE MODELLING

What does language do?

It is generally agreed there are four functions of spoken language:

to convey information

to provide explanation

to provide commentary

to express attitude and emotion.

The first three functions convey the plain message which is communicated through written language as well as spoken language. The fourth conveys the attitude and feeling state of the speaker which is mainly through speaking. The standard model claims emotive content is overlaid on the plain message, which itself is regarded as having non-emotive content, often labelled 'neutral' expression.

How do we model expressive speech?

The Standard Model

The problem is: how can we model emotive content associated with an abstract derived sentence. How can the output of the phonological processing component be modified? At what point could the associated emotive content be accounted for in spoken language?

Initially, my work was within a standard framework derived from the multilayered traditional generative grammar approach. In this model, the phonological component specifies the overall abstract sound pattern of the word or sentence within each language. The speaker puts together an abstract idealized prosodic and segmental specification of the intended utterance - called the utterance plan. Emotive content is not part of the linguistic model.

To construct an account of emotive effects, the standard approach is to set up two input channels; one, based on the linguistic description. conveying text input (the plain message) output from the phonology and a second channel consisting of a sequence of emotive markers synchronized with the text. Emotive effects are added to the plain message. The aim of my work, along with others, was to ask: how observed emotive speech effects might be modeled as an extension to the non-emotive prosodic effects of intonation and rhythm found in phonological descriptions.

With some differences in method and the choice of basic units, my early work in Pragmatic Phonetics followed the principles of the standard model: modifications to the assignment of fundamental frequency changes and duration of syllables are systematically overlaid on the 'neutral' representation. Emotive effects are described as rule-governed departures from neutral contours which are the default values produced by an idealized neutral production system. Pragmatic rules were activated when pragmatic expression markers were generated at a higher level in the language understanding section of the conversation/discourse model.

Data
The data was derived from human speech, in which variations in the acoustic parameters fundamental frequency, syllable timing and amplitude are assumed to provide information to the listener about attitude and emotion. These changing values become incorporated in the acoustic realization of the phonological output of the linguistic sentence to produce utterances intended to trigger the perceptual effect of emotion and attitude. In common with many researchers, the model was stated in computational terms so that its coherence could be tested using a speech synthesizer.

Testing the model
Many researchers test models using a speech synthesizer to generate sample utterances containing emotive content generated by rule. Synthetic speech without emotive content is described as flat or mechanical and some researchers label this output 'neutral'. To produce an waveform which simulates speech, specific fundamental frequency and timing values are assigned correlated with the abstract pattern specified by phonology. The fundamental frequency values, which the listener perceives as intonation patterns, and timing of words and syllables are manipulated by rule to produce a perceived effect of emotive content. (some systems also allow manipulation of amplitude.

PRAGMATIC PHONETICS - a departure from the Standard Model

It is essential to distinguish between what the speaker does, and how it is modeled. Models by definition are abstractions; a model is a set of generalizations that put forward hypotheses about the phenomenon being researched. In this case, the speaker/listener and the flow of information about emotion/attitude are being modeled.

In the research literature on emotive content of utterances, the word 'neutral' has been used two ways: as a term in a model and as a description of what a speaker does. As a term within the model, 'neutral' describes an abstract non- emotive content from which other emotions have been mapped. In the case of speaking style, 'neutral' is used to characterize one of the results of what the speaker does - produce non-emotive speech - and is equivalent in status to other emotive styles such as 'happy'. Some researchers also map from 'neutral', meaning no emotion, to the other emotions on the same level without reference to an underlying abstract concept 'neutral'. This can be confusing.

In the approach I suggest, the pragmatic modelling of expression, the word neutral is used in the second way, to characterize emotive content in the utterance. Neutral is seen as an expression on the same level as other expressions such as happy, sad, etc. Since the emotion expressions have equivalent status, emotive content is not mapped from neutral to the other emotion expressions. The content of the phonological plan is seen as placed within the speaker's overall emotive stance on each occasion. There is no mapping between or within descriptive levels. Emotive content is described as varying within sliding scales on labelled sections of the speech waveform.

The Pragmatic model requires syllables or words to be marked with respect to a range of possible prosodic values and markers assigned to larger units such as phrases. Emotion features select within the range of acceptable possibilities and stretch or compress these ranges. The abstract values with the range of possibilities is correlated with real-world acoustic values.

In this model, all spoken language is assumed to be affected by the physiological setting of the speaker. What can been regarded as non-emotive speech, triggering a perception of little or no emotion, is produced by a speaker whose physiological settings are at a low level of excitation or a high level of inhibition. Angry speaking results from variable tension in the physiological system - the tension results in motor control patterns that vary from the settings for, say, happy speaking. Vocal tract configuration varies as the speaker's muscular tension varies.

Changes in vocal tract innervation and shape result in changes which appear in the speech waveform: the listener is able to detect these changes and usually interprets them adequately. Listeners sometimes get it wrong - then the listener may ask the speaker: 'Do you intend to sound annoyed?' or the listener may ask for corroboration 'I can tell you've had good news', or be uncertain 'I don't think you meant to say that, did you?', etc.

In summary, the system works like this: the speaker knows the the sound pattern of the language; this is described by linguistics (the phonological component). In the speaker, the phonological plan for the utterance is assembled at a cognitive level. The appropriate prosodic effects are assigned. It is possible to also intend to express emotion in a particular way - to add, to subtract, to repress emotive content, although generally in conversation, we don't consciously plan to speak in a particular way. Well known exceptions are newsreaders, lecturers, suppressing irritation in an argument. Whether planned or as a result of the physiological setting of the speaker, emotive effects appear in the waveform. The listener can detect many of the intended changes, as well as the unintended ones.

The task for the researcher is to describe these changes. Furthermore, speakers produce utterances with ongoing variability. That is, emotive dominated changes can occur within an utterance. The speaker can begin a sentence calmly and end it laughing, crying, or raging. This observed ongoing variability cannot be easily modeled in the Standard Model where emotive content is simply added to the plain message.

Testing the model
The results of the research are described by a computational model, and tested using speech synthesis. Synthesis allows testing the coherence of the model and provides a stimuli set for testing listener's responses in an experimental setting. The Pragmatic Phonetic utterance model is evaluated by correlating the acoustic changes of fundamental frequency and duration with perceived differences reported by panels of listeners.

[I should emphasize that synthesis is used to test the model; this work is not designed as an engineering implementation to improve synthetic speech. To include a pragmatics module in a computationally adequate model for synthesis, it is necessary to have not just an understanding of how the acoustic signal must be made to vary, but also a formal, theoretically based, means of triggering the varying acoustic signal at the right moment. The theoretical underpinning is incomplete, which is one reason why speech synthesis systems are poor simulations of natural speech.]

THE FUTURE

Work continues on modifying the current Pragmatic Phonetic Model to describe emotive content. Data gathering from instances of utterances judged to have perceptible emotive content is continuing. The model continues to be computationally adequate, and is implemented on various speech synthesis systems to produce stimuli for testing. Work on rhythm generation, and comparison with models of other types of emotive expression such as face-emotion have begun.

The work is closely associated with the development of Cognitive Phonetics (M. Tatham). Tatham 's work is about modelling the cognitive control of the neuromuscular system resulting in intended variation in the acoustic wave. The principles of Cognitive Phonetics are applicable to emotive modelling which assumes underlying biological and cognitive components and assumes intended and unintentional emotive effects in the utterance.

Palmyra Katherine Morton A.B. Ph.D.
email: katherine.morton@btconnect.com