Katherine Morton and Mark Tatham
There is a great deal of interest currently in developing computer speech for human-computer interaction. Voice output forms part of information access systems and is used in exchanging information among groups using computer networks.
However, in order to be useful and acceptable, computer voice output systems must be of high quality and interface easily with language understanding systems and voice recognition. Irrespective of how good automatic speech recognition, dialog system design, and the information content may be, the success of the entire interaction system will be judged by how easy communication is between human user and machine. Poor voice output will be judged as poor response by the entire system, and will detract from the willingness of users to take up a voice option in a human/computer interaction context.
What would a fully developed computer-based conversation system minimally need to do?
However, there are currently no speech systems which can fully achieve these goals. A number of problems can be identified which are currently impeding full development of conversational spoken language systems. In addition to cost, there are problems of menu structure, identifying faster and more direct approaches to system design, and technology requirements such as real-time operation, repair, indexing, and incorporating adequate language models, including suitable voice output for dialogue exchange incorporating more complex ideas than simple factual information.
BENEFITS IN USING SPOKEN LANGUAGE
All these functions need to be incorporated into HCI systems. For example, in the most basic case, computer speech provides information in response to inquiries. In teaching or giving instructions, explanations are essential. In monitoring progress of a user as in learning, commentary provides feedback. The final feature, often referred to as 'tone-of-voice' can be crucial; a firm friendly tone of voice is more acceptable to the user than a mechanical effect that so much computer speech currently produces.
Many speech synthesis systems provide intelligible output and are now on sale in the market place. However, their use is limited because they do not appear to pass a threshold for naturalness. Naturalness incorporates features such as a good intonation pattern, clear rendering of segments, and adequate timing between segments, words, phrases and sentences - especially in paragraph length explanation. It is becoming increasingly important to incorporate responsive change in the voice output corresponding to variations in the dialogue between the computer and the user.
For example if a human user appears confused, a synthetic voice repeating the same message 'this is not correct, please try again' can become irritating after a few repetitions. However, if the syntax, choice of words, and tone of voice can convey patience, firmness, and can give the impression of sympathetic understanding to the plight of the confused user, the user will be more inclined to persist with trying again and again. The appropriate rendering depends on adequate specification of high level synthesis. For example:
The user may well give up.
The changes are within segments, words, phrases, and sentences - all specified at the higher synthesis level.
HOW IS SYNTHETIC SPEECH PRODUCED
In generating synthetic speech, it is necessary to assign physical values to the acoustic features which drive the synthesizer. To produce stress, intonation, and timing effects that characterize human speech, it is essential to assign the appropriate physical values. The initial assignment is made in high level synthesis. Phonological units, phrases, entire sentences are given abstract values which are rendered at the low level by a set of mapping rules.
The prosodic subcomponent of phonology describes a generalized intonation pattern within the sentence. For synthetic speech, it is necessary to assign either specific fundamental frequency values, or a range of values, corresponding to the abstract pattern specified by phonology in order to produce intelligible, although flat sounding, neutral speech.
Thus there are two levels within the high level synthesis component: the assignment of abstract labels and specification of the rules which link the abstract label with the actual physical values, the numbers, which drive the low level synthesis system.
The sentence Who is speaking? to be realized as the utterance Who's speaking?
An abstract general pattern of 'question' is assigned to the sentence. The main feature of this pattern is that the intonation increases giving rise to the perception of a higher tone at the end of the utterance - when it is finally produced.
An abstract assignment to delete the vowel in is is implemented, taking into account the observation that speaker's rarely speak the entire word 's but elide the vowel in this linguistic context.
An abstract assignment of reduced stress is placed on the syllable ing within the word speaking because the rules of English pronunciation require unequal stress in this word.
Mapping rules relate the abstract assignment to real physical values:
These altered values are directed toward the low level synthesizer which resets each time it receives a new set of information from the result of applying mapping rules from the high level synthesis software.
Without a well specified high level system, the low level could not approach intelligibility, much less sound natural.
The high level system also can incorporate a set of rules which modify the assignment of fundamental frequency changes and duration of syllables to create emotive effects. These rules are described as departures from default contours which are characterized by sets of rules within the procedures of the synthesis system. They operate by changing values within a range of values associated with the realization of the abstract assignments which produce a flat utterance. The effect is to produce more natural sounding speech - speech that produces the effect of emotion and attitude.
These rules are activated when markers for emotive content are generated at a higher level in the language understanding section of the conversation/discourse model; this component is sensitive to the changing requirements of the dialogue as it unfolds.
In human speech, changes to the fundamental frequency (which minimally cues perception of intonation), to syllable duration and to amplitude provide information to the listener about attitude and emotion. However, for technical reasons, changes to the acoustic specification in speech synthesis are usually limited to changes in fundamental frequency and duration. Therefore, synthetic emotive effects are limited to manipulation of the fundamental frequency and duration.
HCI systems are used to deliver information that's important to the users. Therefore they have an expectation of a high level of naturalness in synthetic voice output. In a sustained dialogue of more than a few exchanges, or where a problem or confusion arises, the user will be more likely to persist if the voice output does not have distracting acoustic features such as poor voice quality, inadequate timing, repetitive intonation, and poor segmental rendering. A paradox has been created; the better the synthetic speech systems have become, so have the users become more discriminating. Small differences that were ignored are now focused on, and judgments made about the acceptability of the entire dialog system, of which the synthesizer is only a part.