Measuring emotion in the voice during psychotherapy interventions : A pilot study

The voice as a representation of the psychic world of patients in psychotherapeutic interventions has not been studied thoroughly. To explore speech prosody in relation to the emotional content of words, voices recorded during a semi-structured interview were analyzed. The subjects had been classified according to their childhood emotional experiences with caregivers and their different attachment representations. In this pilot study, voice quality as spectral parameters extracted from vowels of the key word “mother” (German: “Mutter”) were analyzed. The amplitude of the second harmonic was large relative to the amplitude of the third harmonic for the vowel “u” in the secure group as compared to the preoccupied group. Such differences might be related to the subjects’ emotional involvement during an interview eliciting reconstructed childhood memories. Key terms: mother, attachment, vocal cues, emotion. emotional involvement of speakers. Vocal patterns denote involuntary physiological changes in the speaker’s speech production system (Scherer, 1986), as well as culturally accepted speaking styles (Scherer et al., 2001; Campbell and Erickson, 2004; Erickson 2005). Physiological arousal and the appraisal of the emotion experienced exert a particularly strong influence on the configuration of vocal cues during discourse (Siegmann and Boyle, 1993; Pell, 2001). A number of studies have shown that different emotions are cued by various combinations of acoustic parameters; speech rate and the fundamental frequency presumably exerting the strongest effect (Banse and Scherer, 1996, Murray and Arnolt, 1993; Pell , 2001).The mean fundamental frequency and speech rate are generally higher for emotions associated with high sympathetic arousal, like anger Correspondence concerning this article should be addressed to Dr. Maria Eugenia Moneta, Faculty of Medicine, University of Chile, Santiago de Chile. E-mail adress: mmoneta@med.uchile.cl Received: August 4, 2008. In Revised form: January 13, 2009. Accepted: January 19, 2009 INTRODUCTION Spoken language is considered to be the main psychotherapeutic tool (Russel, 1993); however, the relevance of voice quality for the communicative process, i.e. the effects of emotion on speech intonation and tempo, has received little attention. Dittman and Wynne (1961) proposed that para-linguistic variables like pitch and speech rate may capture emotional signals from discourse, but they could be independent of the subject’s emotional expression in an interview. Studies on speech structure and emotion imply various components, spanning from the emotions underlying cognitive activity to the accompanying physiological responses during different emotions (Scherer, 1986; Gobl and Chasiade, 2003; Campbell, 2004). Modern techniques for voice analysis are accurate enough to unveil shifts in the MONETA ET AL. Biol Res 41, 2008, 389-395 390 and fear, as well as happiness and anxiousness (Banse and Scherer,1996; Ellgring and Scherer, 1996; Johnston and Scherer, 1999; Scherer 2003; Hachizaga et al., 2004; Erickson 2005). Fundamental frequency and speech rate are the parameters that give a better assessment of the emotional state of speakers as evaluated by listeners (Banse & Scherer, 1996; Breitenstein et al., 2001). An important, but difficult-to-analyze aspect of expressive speech, is voice quality, defined as the quality of “a sound by which a listener can tell that two sounds of the same loudness and pitch are dissimilar” (ANSI, Psycho-acoustical terminology, 1973; Campbell and Mokhtari, 2003). Changes in voice quality can signal both paralinguistic information in terms of changes in the speaker’s emotional state, mood or attitude to the message and listener, and nonlinguistic information in terms of the speaker’s social or geographical background, as well as personal characteristics related to the speaker’s physical constitution or health (Mokthari, 2003; Erickson, 2005). Changes in voice quality are the result of changes in the configuration of the vocal tract, laryngeal and glottal source (Mokthari, 2003; Skakibara and Imagawa, 2004). These changes can be quantified by comparing the amplitude levels of different spectral components, among which the first and second harmonic and the first formant have been used (Gordon and Ladfoged, 2001). In addition, spectral slope has been measured by splitting the spectrum in third-octave bands and fitting a line to their respective energies as an indication of harshness or softness (Schroeder, 2004). Rice & Kerr (1986) developed a qualitative approach for studying vocal expression during client therapist interaction and outcome in psychotherapy. However, no quantitative analysis of the structure of vocal patterns of either patient or therapist has yet been attempted. In contrast to the lack of studies on vocal correlates of emotion, facial expression during psychotherapy has received considerable attention in recent years (Bänninger-Huber, 1992; Krause et al. 1992; Krause, 1998; Dreher et al., 2001). In this study, presented here as a preliminary report, we explored voice quality through vocal correlates of emotional styles evoked by the word “mother” (German: Mutter) in subjects selected for psychotherapy. A standardized interview, the Adult Attachment Interview (AAI; George et al., 1994), was used to classify the subjects in three categories according to three attachment representations (Secure, Dismissing and Preoccupied). Secure individuals give an open, coherent and consistent account of their childhood memories, regardless if they were positive or negative. These persons can easily address the topics asked about and convey an emotional balance about the relationships with their parents. Adults with the Dismissing classification give incoherent, incomplete accounts of the experience with their parents and often show gaps in memory. As a defense against painful memories, they minimize the importance of attachment experiences. These people insist on the normality of their affections and on their inner independence from their parents. Preoccupied adults recall childhood experiences in an angry, excessive and nonobjective way. A characteristic of this group is the oscillation between positive and negative evaluations without being conscious of this contradiction. The language employed seems confused, unclear and vague. We predicted that the spectral structure of the word “Mutter” during recall of childhood memories differs among subjects of these three categories, affecting their voice quality.


INTRODUCTION
Spoken language is considered to be the main psychotherapeutic tool (Russel, 1993); however, the relevance of voice quality for the communicative process, i.e. the effects of emotion on speech intonation and tempo, has received little attention.Dittman and Wynne (1961) proposed that para-linguistic variables like pitch and speech rate may capture emotional signals from discourse, but they could be independent of the subject's emotional expression in an interview.Studies on speech structure and emotion imply various components, spanning from the emotions underlying cognitive activity to the accompanying physiological responses during different emotions (Scherer, 1986;Gobl and Chasiade, 2003;Campbell, 2004).
Modern techniques for voice analysis are accurate enough to unveil shifts in the and fear, as well as happiness and anxiousness (Banse and Scherer,1996;Ellgring and Scherer, 1996;Johnston and Scherer, 1999;Scherer 2003;Hachizaga et al., 2004;Erickson 2005).Fundamental frequency and speech rate are the parameters that give a better assessment of the emotional state of speakers as evaluated by listeners (Banse & Scherer, 1996;Breitenstein et al., 2001).
An important, but difficult-to-analyze aspect of expressive speech, is voice quality, defined as the quality of "a sound by which a listener can tell that two sounds of the same loudness and pitch are dissimilar" (ANSI, Psycho-acoustical terminology, 1973;Campbell and Mokhtari, 2003).Changes in voice quality can signal both paralinguistic information in terms of changes in the speaker's emotional state, mood or attitude to the message and listener, and nonlinguistic information in terms of the speaker's social or geographical background, as well as personal characteristics related to the speaker's physical constitution or health (Mokthari, 2003;Erickson, 2005).Changes in voice quality are the result of changes in the configuration of the vocal tract, laryngeal and glottal source (Mokthari, 2003;Skakibara and Imagawa, 2004).These changes can be quantified by comparing the amplitude levels of different spectral components, among which the first and second harmonic and the first formant have been used (Gordon and Ladfoged, 2001).In addition, spectral slope has been measured by splitting the spectrum in third-octave bands and fitting a line to their respective energies as an indication of harshness or softness (Schroeder, 2004).
Rice & Kerr (1986) developed a qualitative approach for studying vocal expression during client therapist interaction and outcome in psychotherapy.However, no quantitative analysis of the structure of vocal patterns of either patient or therapist has yet been attempted.In contrast to the lack of studies on vocal correlates of emotion, facial expression during psychotherapy has received considerable attention in recent years (Bänninger-Huber, 1992;Krause et al. 1992;Krause, 1998;Dreher et al., 2001).
In this study, presented here as a preliminary report, we explored voice quality through vocal correlates of emotional styles evoked by the word "mother" (German: Mutter) in subjects selected for psychotherapy.A standardized interview, the Adult Attachment Interview (AAI;George et al., 1994), was used to classify the subjects in three categories according to three attachment representations (Secure, Dismissing and Preoccupied).Secure individuals give an open, coherent and consistent account of their childhood memories, regardless if they were positive or negative.These persons can easily address the topics asked about and convey an emotional balance about the relationships with their parents.Adults with the Dismissing classification give incoherent, incomplete accounts of the experience with their parents and often show gaps in memory.As a defense against painful memories, they minimize the importance of attachment experiences.These people insist on the normality of their affections and on their inner independence from their parents.
Preoccupied adults recall childhood experiences in an angry, excessive and nonobjective way.A characteristic of this group is the oscillation between positive and negative evaluations without being conscious of this contradiction.The language employed seems confused, unclear and vague.We predicted that the spectral structure of the word "Mutter" during recall of childhood memories differs among subjects of these three categories, affecting their voice quality.

Subjects
The study is based on the Adult Attachment Interview (AAI) conducted during the first therapeutic session with ten female subjects at the Psychosomatic Clinic of the University of Ulm.All subjects have had their first baby in the last two months and were between 27 and 32 years old.They were instructed about the AAI, in which they are asked to remember personal emotional issues in a psychotherapeutic setting.Patients were asked about their willingness to collaborate in this study.All subjects were interviewed by the same therapist (A.B.) and were native-German speakers.The AAI measures the current representations of past and present attachment experiences based on narrative accounts.The inter-individual differences in the assessed attachment representations form three main categories: "Secure", "Dismissing", "Preoccupied" (Main & Goldwyn, 1994).Three Secure, four Dismissing and four Preoccupied subjects were analyzed according to their vocal spectrum.

Procedure
For the present study, the word "mother" (German: Mutter) was selected from the audiotapes of AAIs of subjects of the three attachment groups to compare the affective quality of speech.Recordings of the interviews containing the word "Mutter" in response to specific questions were acquired at a sampling rate of 22050 Hz with a Macintosh computer (Power PC 7100), using the Sound Edit 16 software.The acquired sounds were further analyzed with the Signalyze 3.12 software.Twelve repetitions of the word "mother" in response to the first half of the interview, containing emotionally loaded questions from the AAI in relation to the mothers were analyzed for each subject.Power spectra (0-5500 Hz, 20 Hz resolution) of the two vowels of the word "mother" were obtained at the midpoint of the vowels u and e ("Mutter").The amplitudes and frequencies of the fundamental frequency of the following spectral peaks were measured, fundamental frequency (Fo) and the second and third (H2 and H3, respectively).The differences in amplitudes and frequencies of these three spectral peaks were computed, and averages of these measurements calculated for each individual and compared among individuals of the three groups with a One-way ANOVA and the Duncan's statistical test (P< 0.05).In addition to power spectra, oscillograms and sonograms of the complete word were obtained for graphic representation of this sound.

RESULTS
Fig. 1 shows the oscillogram, sonogram and power spectra of the word "Mutter" from one individual of the secure group.Peaks corresponding to Fo, H2 and H3 from the power spectra were taken from the middle part of vowels u and e.
Power spectra showed that Fo was about 206.20 ±12.27Hz for vowel u and 205.92 ±18.15 Hz for vowel e when all the subjects were pooled for analysis, and no significant differences occurred between groups (ANOVA: F 2 = 0.659, P = 0.719).However, the amplitude differences between Fo, H2 and H3 varied considerably among individuals and groups.Amplitude differences in dB between Fo and H3 (Fo-H3) and between H2 and H3 (H2-H3) were always positive for vowel u, i.e.: Fo and H2 always had larger amplitudes relative to H3.A tendency to higher values occurred for some of these amplitude differences in the Secure group relative to the Preoccupied group, yielding significant differences in the amplitude of H2 and H3 for vowel u between both groups (H2-H3: ANOVA: F 2 = 6.727,P = 0.0346; Duncan's test, P = 0.045, Fig. 2).The Dismissing group showed intermediate values for these measurements and did not differ significantly from the values of the Preoccupied and Secure groups (Duncan's test, P = 0.640).

DISCUSSION
In this preliminary study on voice quality measurements related to a psychotherapeutic interview, we have found differences in the relative amplitudes of harmonics H2 and H3 between subjects of the Preoccupied and Secure groups for the vowel "u", but not for the vowel "e".The spectral variation in vowel "u" is probably related to its emphasis and longer duration in the phonetics of the German word "Mutter" (mother).Our data suggest subtle variations in the voice for the Preoccupied group relative to the Secure and Dismissing groups, while pronouncing the word "Mutter".As mentioned in the Introduction, this could imply a difference in voice quality due to emotional arousal.Changes in voice quality are related to modifications of the vocal tract; involuntary changes in tonicity and thus, we believe, speech production is a vehicle for emotion and mood.
Although preliminary, our data point to the usefulness of measuring the amplitude of harmonics in detecting affective attributes of speech.Measurements of the first harmonics could indicate more subtle changes in the voice than those detected by variations of the fundamental frequency and formants.
The lack of differences in Fo among subjects in our study could be related to a similar level of arousal for all individuals interviewed by the same therapist.Such invariance could account for a convergence in fundamental frequency between speakers during conversation, as reported by Gregory et al (2000).Although a number of studies have focused on pitch variables, in particular fundamental frequency (Scherer, 1986;Tolkmitt and Scherer, 1986;Murray and Arnolt 1993), little is known about the role of voice quality in communicating affection.As pointed out by Scherer (1986) and later on by Gobl and Chasaide (2003), the tendency has been to concentrate on those parameters that are relatively easy to measure, such as fundamental frequency and timing variables.Johnston and Scherer (2001) have found that fundamental  frequency varies with emotional state, being lower for states related to boredom and depression and higher for happy and anxious states.However, the study of voice quality implying more detailed analysis of spectral contents has been restricted because of methodological difficulties involved.
Voice quality signals information concerning the speaker's attitude towards the interlocutor, the subject matter and the situation.Furthermore, the listener's evaluations of emotions appeared to be primarily determined by voice quality cues other than fundamental frequency (Gobl, 1989;Lee and Childres, 1991;Gobl and Chasaide, 2003).Gobl and Chasaide (2003) in particular, have emphasized that voice modifications are more related to attitudes, states and mood rather than to specific emotions.A human's ability to "listen between-the-lines" is heavily dependent on voice quality (Campbell, 2004).In terms of subjective perception of the word "mother" (Mutter) by non-instructed listeners, they could not appreciate any audible differences between the subjects in our study.
Changes in the spectral structure of speech that may denote emotional differences have been reported by measuring formants at different stages of therapeutic interventions (Tolkmit & Scherer, 1986).Fujimura and Erickson (2004) have proposed a model that incorporates speech rhythm aspects, as well as the linguistic and sociolinguistics aspects of expressive speech during conversation.Nevertheless, a comprehensive model of expressive speech is still lacking.
It has been established that the infant brain, as early as seven months after birth, detects emotionally loaded words and shows differential attentional responses, depending on their emotional valence (Grossmann et al., 2005).Infants are able to interpret prosody and recognize and imitate vocal patterns and rhythms, discriminating intonational aspects of human voice (Fernald, 1993;Spence and Freeman, 1996;Floccia et al., 2000).They can also respond to variations in frequency, intensity and temporal patterning of sounds (implicit knowledge) signaling affective states (Bebbe et al., 1997;Kuhl et al., 1997;Papousek and Papousek 1981).Such an early disposition to react to the emotional contents of speech in an implicit way suggests the need for further research on prosody influences on interpersonal encounters in the therapeutic context.As Beebe points out, words are not enough; music is needed (Bebbe, 2007).responsiveness to vocal affect in familiar and unfamiliar languages.Child Dev, 64: 657-674 FLOCCIA C, NAZZI T, BERTONCINI J (2000).

Figure 1 :
Figure 1: Oscillogram, sonagram and power spectra of the word "Mother" (German: Mutter) from one individual of the secure (S) group.Peaks corresponding to the fundamental frequency (Fo), second (H2) and third (H3) harmonics are shown in the power spectra of the vowels u and e.

Figure 2 :
Figure 2: Differences between the amplitudes of the spectral peaks of the fundamental frequency (Fo) and second (H2) and third (H3) harmonics for the vowels u and e for the three attachment groups.Each circle indicates the average value for an individual.Significant differences (Duncan test, P < 0.05) occurred only for vowel u (amplitude H2-amplitude H3) between the preoccupied (P) and secure (S) groups.
Strategies of Discovery.New York: Plenum Press.SAKAKIBARA K, IMAGAWA H (2004) Acoustical interpretation of certain laryngeal settings using a physical model.Proc Speech Prosody, Nara: 637-640 SCHERER K, JOHNSTON T, BANZINGER T (1998) Automatic verification of emotionally stressed speakers: The problem of individual differences.Proc Intern Workshop on Speech and computer, St. Petersburg.SCHERER K (1986) Voice, stress and emotion.In M. H. Appley & Trumbull (Eds.)Dynamic of stress.New York: Plenum Press pp: 159-181.SCHERER K, BANSE R, WALLBOTT HG (2001).Emotion inferences from vocal expression correlate across languages and cultures.J Cross-cultural Psychol, 32, 1:76-92 SCHERER K (2003) Vocal communication of emotion: "A review of research Paradigms."Speech Commun, 40: 227-256 SCHROEDER M (2004) Speech and Emotion research: an overview of research framework and a dimensional approach to emotional speech synthesis.Doctoral Thesis, Phonus 7, Res Inst Phonetics, Saarland University SIEGMAN A W, BOYLE S (1993) Voices of fear and anxiety and sadness and depression: The effects of speech rate and loudness on fear and anxiety and sadness and depression.J Abnorm Psychol, 10:430-437 SPENCE M J, FREEMAN MS (1996) Newborn infants prefer the maternal low-pass filtered voice, but not the maternal whispered voice.Infant Behav Dev, 18, 15: 727-735 TOLKMITT P, SCHERER K (1986) Effects of experimentally induced stress on vocal patterns.J Experim Psychol: Human Perception and Performance, 12: 302-313