FRONT PAGE 
 
      CONTACTS  | 
      
      
      Katherine Morton - RESEARCH PROFILE   
      
        
      Suffolk Daffodils
        [key words: 
      expression, 'emotive content of spoken language',
        linguistics, biopsychology, psychoacoustics, 
      'acoustic phonetics'] 
       
      RESEARCH INTEREST 
      
       
      What are we saying when we say 'Hello'? These two syllables, 4 separate sounds, are a greeting. They are an 
      acknowledgement of meeting someone, of letting the other person know the 
      speaker is open to the beginnings of a conversation. 
       
      But this word also can be spoken in a number of different ways; the 
      emotive content can inform the listener that the speaker is happy, bored, 
      in love, irritated, perplexed, in a hurry, frightened. All this 
      information can be conveyed by adding other words to 'hello' such as 
      'hello, I'm happy to see you again', but most speakers convey their 
      feeling and attitude by subtle alterations in the manner of speaking, 
      sometimes called tone-of-voice. 
       
      Thus, spoken language contains other information than the plain message 
      expressed by the words, syllables, and segments. My current work addresses 
      the question: What is this other information and how might it be described 
      and modeled? 
      BASIC INTEREST 
      My basic interest is in the relationship between cognitive processing and the 
      physical realization of the result of that processing - the translation of 
      cognition into action. The specific research has focused on modeling one 
      aspect of human language behavior - what we do when we produce and 
      perceive spoken language. Both speaking and understanding the message are 
      mediated by biological and cognitive systems.  
      That is: when we think of something to say, speakers draw on their 
      knowledge of the language, configure the vocal tract to produce the 
      appropriate sounds: listeners detect these sounds, interpret them 
      according to the rules of the language, and assign a meaning. Ideally, the 
      resulting concept for the listener is similar to the speaker's intention. 
      This process involves cognitive and physiological activity on the part of 
      both speaker and listener. These activities are associated in a coherent 
      way, otherwise we would not be able to communicate.  
      RESEARCH HISTORY - a brief statement 
      
        - 
        EARLY work consisted of modeling what we know about some phonological 
      aspects of language correlated with the physical expression of that 
      knowledge (i.e. moving the articulators). The work is summarized below.
 
       
      
        This work was extended to looking at the resulting acoustic wave shape; 
      the aim was to relate some aspects of the cognitive intention to speak, of 
      motor control realizing the intention,  and of
        the resulting acoustic 
      features.  
       
      
        - 
        This led to CURRENT work which is concerned with modelling the 
      communication of the emotion accompanying the plain linguistic message. 
      Both the linguistic message and the emotional coloring are encoded in the 
      speech waveform. Listeners detect and interpret the plain linguistic 
      message according to their knowledge of the language, whilst also being 
      aware of emotion overtones. This work is summarized below.
 
       
      SUMMARY OF RESEARCH - a fuller statement 
      
        - 
        EARLY WORK: Models of 
        spoken language production
 
       
      The work was based on proposing a relationship between 
         
       
      
        
          - 
          linguistic knowledge which describes 
          what speakers know about their language in order to produce an 
          utterance and 
 
  
          - 
          descriptions of the properties of 
          the motor control system and configuration of the vocal tract to 
          produce the intended spoken language.
 
         
        Experimental design centered on the question type: how can linguistic 
      knowledge be implemented within the constraints imposed by the biological 
      system? To this end, experimental work was conducted using emg, 
      aerodynamic, and other techniques. The work led to development of a model 
      of motor control directed toward describing gestures associated with the 
      linguistic intention of producing specific repeatable and unambiguous 
      segments and syllables. 
       
      Research effort moved from looking at physiological variables to looking 
      at the resulting acoustic variables in the speech waveform. The model was 
      tested using early speech synthesis. Later, neural networks were used to 
      simulate production and perception. This led into development of the PALM 
      project (Psycho-Acoustic Language Modeling) which underlies the current 
      work. 
       
      
        - 
        CURRENT WORK: Models 
        of emotive spoken language
 
       
      In spoken language, the basic message with the necessary linguistic 
      information content is conveyed by the words and their arrangement within 
      the sentence. However, speakers and listeners have feelings about one 
      another, attitudes about what we are saying, and attitudes toward the 
      listener. 
       
      We are all aware that humans speak with emotive content, with expression 
      of some amount of feeling. Feelings expressed in speech may be mild, or so 
      strong they elicit strong feelings from the listener. That is, remove 
      emotive content, and what's left is the plain message with none of the 
      feelings which surround all speech. Some researchers claim and some 
      speakers report 'neutral' speech. But others maintain that neutral does in 
      fact convey an impression - perhaps minimal emotive content, perhaps an 
      awareness of disinterest or calm, but not zero emotive content. 
       
      And as a result of awareness of emotive content, we often make judgments 
      about others not by WHAT is said but HOW it is said: by the tone-of-voice. 
       
      The research objective is to model those aspects of spoken language which 
      carry emotive content. The first stage is to determine some of the 
      essential prosodic information that cues perception in the listener of the 
      emotional overtones and attitudes conveyed by the speaker and correlate 
      this information with the acoustic wave and with reports from listeners. 
       
      Building on the work in PALM, I am looking at variations in the acoustic 
      wave which appear to produce emotive perceptual effects in the listener. 
      The broad question is: how do we communicate on non-linguistic levels 
      within the constraints imposed by the biological and cognitive system. I 
      call the approach 'Pragmatic Phonetics'. 
       
      For example, given the utterance: 'hello', 
        how many ways are there of saying this? 
       
      
        This simple utterance can sound happy, sad, irritated, sardonic, fearful, 
      confident and so forth. The emotive effects are achieved by manipulating 
      the vocal tract to produce an acoustic waveform that results in perceived 
      intonation, rhythm, and loudness of the syllables, words, and phrases. 
      Listeners also perceive subtle interactions of all three parameters.  
       
      The standard model of emotive spoken language divides the utterance into 
      two parts: 
        
          - 
          a description of the the plain message described by linguistics 
      (minimally by semantics, phonology, syntax) and 
 
  
          - 
          a description of emotive content in terms of varying intonation,, 
      increased or decreased energy on stress patterns, and increased or 
      decreased loudness of the utterance. (The acoustic correlates are: 
      fundamental frequency variation, amplitude changes, timing changes, timing 
      variation.)
 
         
        My approach is to not to work within these two divisions but to consider 
      the plain message as wrapped in an emotive environment. The speaker's 
      physiological setting produces changes in the vocal tract which give rise 
      to variations in the acoustic waveform. These changes are detected and 
      interpreted by the listener. Emotive parameters are variables with a range 
      of numerical values associated with these parameters. 
       
      Thus emotion is not added to the basic message but occurs across the 
      entire utterance. This approach will allow for changes in the numerical 
      values of the emotive variables as the utterance progresses and will deal 
      with changing emotive effects within a dialog. I call this Pragmatic 
      Phonetics. 
       
      MORE DETAIL ON SPOKEN LANGUAGE MODELLING 
      
        - 
        What does language do?
 
       
      It is generally agreed there are four functions of spoken language: 
       
      
        
          - 
          to convey information
 
          - 
          to provide explanation 
          
 
          - 
          to provide commentary 
          
 
          - 
          to express attitude and emotion. 
          
 
         
        The first three functions convey the plain message which is communicated 
      through written language as well as spoken language. The fourth conveys 
      the attitude and feeling state of the speaker which is mainly through 
      speaking. The standard model claims emotive content is overlaid on the 
      plain message, which itself is regarded as having non-emotive content, 
      often labelled 'neutral' expression. 
       
      
        - 
        How do we model expressive speech?
 
       
      
        
          - 
          The Standard Model
 
       
      The problem is: how can we model emotive content associated with an 
      abstract derived sentence. How can the output of the phonological 
      processing component be modified? At what point could the associated 
      emotive content be accounted for in spoken language?  
       
      Initially, my work was within a standard framework derived from the 
      multilayered traditional generative grammar approach. In this model, the 
      phonological component specifies the overall abstract sound pattern of the 
      word or sentence within each language. The speaker puts together an 
      abstract idealized prosodic and segmental specification of the intended 
      utterance - called the utterance plan. Emotive content is not part of the 
      linguistic model. 
       
      To construct an account of emotive effects, the standard approach is to 
      set up two input channels; one, based on the linguistic description. 
      conveying text input (the plain message) output from the phonology and a 
      second channel consisting of a sequence of emotive markers synchronized 
      with the text. Emotive effects are added to the plain message. The aim of 
      my work, along with others, was to ask: how observed emotive speech 
      effects might be modeled as an extension to the non-emotive prosodic 
      effects of intonation and rhythm found in phonological descriptions. 
       
      With some differences in method and the choice of basic units, my early 
      work in Pragmatic Phonetics followed the principles of the standard model: 
      modifications to the assignment of fundamental frequency changes and 
      duration of syllables are systematically overlaid on the 'neutral' 
      representation. Emotive effects are described as rule-governed departures 
      from neutral contours which are the default values produced by an 
      idealized neutral production system. Pragmatic rules were activated when 
      pragmatic expression markers were generated at a higher level in the 
      language understanding section of the conversation/discourse model. 
       
          Data 
      The data was derived from human speech, in which variations in the 
      acoustic parameters fundamental frequency, syllable timing and amplitude 
      are assumed to provide information to the listener about attitude and 
      emotion. These changing values become incorporated in the acoustic 
      realization of the phonological output of the linguistic sentence to 
      produce utterances intended to trigger the perceptual effect of emotion 
      and attitude. In common with many researchers, the model was stated in 
      computational terms so that its coherence could be tested using a speech 
      synthesizer. 
       
          Testing the model 
          Many researchers test models using a speech synthesizer to generate 
          sample utterances containing emotive content generated by rule. 
          Synthetic speech without emotive content is described as flat or 
          mechanical and some researchers label this output 'neutral'. To 
          produce an waveform which simulates speech, specific fundamental 
          frequency and timing values are assigned correlated with the abstract 
          pattern specified by phonology. The fundamental frequency values, 
          which the listener perceives as intonation patterns, and timing of 
          words and syllables are manipulated by rule to produce a perceived 
          effect of emotive content. (some systems also allow manipulation of 
          amplitude. 
         
        
          - 
          PRAGMATIC PHONETICS - a departure from the Standard Model
 
       
      It is essential to distinguish between what the speaker does, and how it 
      is modeled. Models by definition are abstractions; a model is a set of 
      generalizations that put forward hypotheses about the phenomenon being 
      researched. In this case, the speaker/listener and the flow of information 
      about emotion/attitude are being modeled.  
       
      In the research literature on emotive content of utterances, the word 
      'neutral' has been used two ways: as a term in a model and as a 
      description of what a speaker does. As a term within the model, 'neutral' 
      describes an abstract non- emotive content from which other emotions have 
      been mapped. In the case of speaking style, 'neutral' is used to 
      characterize one of the results of what the speaker does - produce 
      non-emotive speech - and is equivalent in status to other emotive styles 
      such as 'happy'. Some researchers also map from 'neutral', meaning no 
      emotion, to the other emotions on the same level without reference to an 
      underlying abstract concept 'neutral'. This can be confusing. 
       
      In the approach I suggest, the pragmatic modelling of expression, the word 
      neutral is used in the second way, to characterize emotive content in the 
      utterance. Neutral is seen as an expression on the same level as other 
      expressions such as happy, sad, etc. Since the emotion expressions have 
      equivalent status, emotive content is not mapped from neutral to the other 
      emotion expressions. The content of the phonological plan is seen as 
      placed within the speaker's overall emotive stance on each occasion. There 
      is no mapping between or within descriptive levels. Emotive content is 
      described as varying within sliding scales on labelled sections of the 
      speech waveform. 
       
      The Pragmatic model requires syllables or words to be marked with respect 
      to a range of possible prosodic values and markers assigned to larger 
      units such as phrases. Emotion features select within the range of 
      acceptable possibilities and stretch or compress these ranges. The 
      abstract values with the range of possibilities is correlated with 
      real-world acoustic values.  
       
      In this model, all spoken language is assumed to be affected by the 
      physiological setting of the speaker. What can been regarded as 
      non-emotive speech, triggering a perception of little or no emotion, is 
      produced by a speaker whose physiological settings are at a low level of 
      excitation or a high level of inhibition. Angry speaking results from 
      variable tension in the physiological system - the tension results in 
      motor control patterns that vary from the settings for, say, happy 
      speaking. Vocal tract configuration varies as the speaker's muscular 
      tension varies. 
       
      Changes in vocal tract innervation and shape 
          result in changes which appear in the speech waveform: the listener is 
          able to detect these changes and usually interprets them adequately. Listeners sometimes 
      get it wrong - then the listener may ask the speaker: 'Do you intend to 
      sound annoyed?' or the listener may ask for corroboration 'I can tell 
      you've had good news', or be uncertain 'I don't think you meant to say 
      that, did you?', etc. 
       
      In summary, the system works like this: the speaker knows the the sound 
      pattern of the language; this is described by linguistics (the 
      phonological component). In the speaker, the phonological plan for the 
      utterance is assembled at a cognitive level. The appropriate prosodic 
      effects are assigned. It is possible to also intend to express emotion in 
      a particular way - to add, to subtract, to repress emotive content, 
      although generally in conversation, we don't consciously plan to speak in 
      a particular way. Well known exceptions are newsreaders, lecturers, 
      suppressing irritation in an argument. Whether planned or as a result of 
      the physiological setting of the speaker, emotive effects appear in the 
      waveform. The listener can detect many of the intended changes, as well as 
      the unintended ones.  
       
      The task for the researcher is to describe these changes. Furthermore, 
      speakers produce utterances with ongoing variability. That is, emotive 
      dominated changes can occur within an utterance. The speaker can begin a 
      sentence calmly and end it laughing, crying, or raging. This observed 
      ongoing variability cannot be easily modeled in the Standard Model where 
      emotive content is simply added to the plain message. 
       
          Testing the model 
          The results of the research are described by a computational model, and 
      tested using speech synthesis. Synthesis allows testing the coherence of 
      the model and provides a stimuli set for testing listener's responses in 
      an experimental setting. The Pragmatic Phonetic utterance model is 
      evaluated by correlating the acoustic changes of fundamental frequency and 
      duration with perceived differences reported by panels of listeners. 
         
        [I should 
        emphasize that synthesis is used to test the model; this work is not 
        designed as an engineering implementation to improve synthetic speech. 
        To include a pragmatics module in a computationally adequate model for 
        synthesis, it is necessary to have not just an understanding of how the 
        acoustic signal must be made to vary, but also a formal, theoretically 
        based, means of triggering the varying acoustic signal at the right 
        moment. The theoretical underpinning is incomplete, which is
        one reason why speech synthesis systems are 
        poor simulations of natural speech.] 
       
      THE FUTURE 
      Work continues on modifying the current Pragmatic Phonetic Model to 
      describe emotive content. Data gathering from instances of utterances 
      judged to have perceptible emotive content is continuing. The model 
      continues to be computationally adequate, and is implemented on various 
      speech synthesis systems to produce stimuli for testing. Work on rhythm 
      generation, and comparison with models of other types of emotive 
      expression such as face-emotion have begun. 
       
      The work is closely associated with the development of Cognitive Phonetics 
      (M. Tatham). Tatham 's work is about modelling the cognitive control of 
      the neuromuscular system resulting in intended variation in the acoustic 
      wave. The principles of Cognitive Phonetics are applicable to emotive 
      modelling which assumes underlying biological and cognitive components and 
      assumes intended and unintentional emotive effects in the utterance.  
       
       
      Palmyra Katherine Morton A.B. Ph.D. 
      email: 
      katherine.morton@btconnect.com 
       |