INTRODUCTION
Correct pronunciation is a very important aspect of second language (L2) acquisition, indispensable not only for speech generation but also for adequate listening comprehension because the articulatory and auditory systems are interrelated: a learner is hardly able to recognize a sound s/he has never produced since it is absent in the first language or L1 (Levis, 2005). However, less accented speech generation and perfect listening comprehension are included in the requirements for some jobs, for instance, operators in call centers, so it is not a rare case that a learner may need more effective training in pronunciation (Hunter & Hachimi, 2012; Lockwood, 2012).
Traditional language courses teach pronunciation and auditory recognition of L2 phonemes commonly using four basic steps:(1) presentation/explanation, (2) imitation, (3) adjustment, and (4) recognition (Celce-Murcia, Brinton & Goodwin, 2010). First, the instructor describes what position the articulatory organs must take and how they must move in order to produce the target sound or sound combination; second, the learner listens to words with the target sound and repeats them; third, the teacher provides feedback and identifies, explains, and corrects errors with relevant exercises until production of the target sound is appropriate depending on the orientation of the course and the learner’s level; fourthly and finally, the learner listens to input and discriminates between a word with the target sound and a word without it.
At step 3 (adjustment), special attention is paid to correcting the student’s errors. Making first articulatory attempts, learners almost always mispronounce the target sound, especially if the phoneme they are practicing at the moment is not present in L1. In fact, committing and correcting errors is a common aspect of the language learning process. Therefore, it is important for a human teacher or an intelligent tutor model to successfully perform the task of providing relevant feedback by identifying errors in the learners’ speech, explaining the causes of such errors, and offering adequate corrective exercises. Such task is possible to accomplish taking into account many linguistic, psychological, and pedagogical aspects. We believe that the primary linguistic aspect is the knowledge of similarities and differences between L1 and L2 pronunciation systems. This knowledge will help to detect learner’s mispronunciations and develop adequate correcting techniques as well as to design teaching methods that anticipate and prevent possible errors.
Therefore, we posed as the objective of our work, firstly, detection of similarities and differences between the phonetic systems of two languages, namely, American English (AE) and Mexican Spanish (MS), with respect to consonants only due to space limitations of a journal article. To achieve this, we perform a theoretic comparative analysis of the consonants of the two languages at the level of both phonemes and allophones. Since allophones vary across variants of a language, we have chosen the above mentioned variants of English and Spanish. To the best of our knowledge, such analysis was not done in previous work. Our comparison is done based on the study of literature on the issues of English and Spanish phonology and phonetics published to date. Secondly, as an example of an application, we consider Computer Assisted Pronunciation Training (CAPT) for teaching American English pronunciation to Mexican Spanish speakers, and in particular, the error detection component in the CAPT model. The results of our analysis are applied in defining some hypothetic error patterns which can be used as a starting point for diagnosing possible mispronunciations, their posterior verification, and adjustment taking into account the principles of phonotactics (Park, 2013) and empirical phonetic analysis of the English learners’ speech (Strange, 2011). Also, we think that the similarities found between the two consonant systems will make it possible to organize and present the pronunciation teaching material using a stress-free method of helping learners to adjust their speech organs to new sounds building on their L1 phonetic habits. In our work, we considered two examples of how such pronunciation teaching strategy can be designed.
The rest of the paper is organized as follows. In Section 1 we review existing pronunciation training systems, consider the basic structure of their underlying intelligent tutor model, and discuss current approaches to error detection. We argue that error patterns are a feasible method to facilitate individual error identification. Section 2specifies our methodology, Section 3 contains a detailed comparative description of AE and MS consonants at the level of phonemes and allophones. In Section 4 we propose error patterns, and in Section 5, consider their usage in the pronunciation acquisition process giving two examples of teaching AE consonants based on our comparative phonetic description given in Section 3. In the end of the article, we outline conclusions and future work.
1. Computer assisted pronunciation training and error detection
Today, Computer Assisted Language Learning (CALL) in general and Computer Assisted Pronunciation Training (CAPT) in particular are recognized as beneficial tools for both L2 teachers and students (Pokrivčáková, 2015). Accessibility in practically all everyday situations, flexibility, adaptability, and personalization make CALL an excellent instrument in any kind of learning: group and individual, formal and informal, stationary and mobile, in and outside classroom (Khan, 2005; Levy & Stockwell, 2006; Burbules, 2012; Liakin, 2013). A variety of commercial CAPT software can be found online: NativeAccent™ by Carnegie Mellon University’s Language Technologies Institute, www.carnegiespeech.com; Tell Me More® Premium by Auralog, www.tellmemore.com; EyeSpeak by Visual Pronunciation Software Ltd. at www.eyespeakenglish.com, Pronunciation Software by Executive Language Training, www.eltlearn.com, Accent Improvement Software at www.englishtalkshop.com, Voice and Accent by Let’s Talk Institute Pvt Ltd. at www.letstalkpodcast.com, Master the American Accent by Language Success Press at www.loseaccent.com. Another example of a CAPT system is the application designed by the University of Iowa Research Foundation located at http://soundsofspeech.uiowa.edu/, see Figure 1.
Notwithstanding the impressive technological advance, intelligent tutor models still require further improvement (Strik, Truong, de Wet & Cucchiarini, 2009; Hismanoglu & Hismanoglu, 2011). The capacity of detecting individual errors in the speech of the learner and providing relevant feedback -activities performed at step 3 (adjustment and correction) of the teaching/learning process- remains an open research issue in CALL. The latter is due to a high complexity of this computational task related to automatic speech recognition (ASR) at a very fine-grained level (Yu & Deng, 2012). In this paper, we focus on this important challenge and address it by performing a comparative phonetic analysis of AE and MS consonant systems. We believe that the similarities and differences found between AE and MS consonant phonemes and allophones as the result of our analysis can be applied to facilitate the individual error detection process by predicting possible mispronunciations. Our results can also be used in the process of teaching AE consonants to MS speakers by developing strategies which anticipate and prevent possible errors. In what follows we discuss the basic elements of an intelligent tutor model (Section 1.1) and then review some existing individual error detection methods (Section 1.2).
1.1. The basic structure of an intelligent tutor model
The basic elements of an intelligent tutor model include tutor, leaner, domain, speech processing, and error detection (Swartz & Yazdani, 2012). These components perform activities which together comprise the L2 teaching-learning process.
The tutor simulates the activities of an English teacher; its functions are as follows:
determine the level of the user (Mexican Spanish speaking learner of English pronunciation in our work);
choose a particular training unit according to the student’s prior history;
present the sound or group of sounds corresponding to the chosen training unit and explain its articulation using comparison and analogy with similar sounds in Mexican Spanish;
perform the training stage supplying the learner with training exercises, determining his/her errors by means of speech processing and error detection, generating necessary feedback, and selecting appropriate corrective drills;
evaluate the learner’s performance;
store the student’s scores and error history.
The learner component models the human learner of English; it contains the student’s data base which holds the following information on his/her prior history:
training units studied;
scores obtained;
errors detected during the stage of articulation training and the auditory comprehension stage.
The domain contains the knowledge base consisting of two main parts:
patterns of articulation and pronunciation as well as pronunciation and auditory perception error patterns characteristic of MS speakers together with individual error samples;
presentation and explanations of sounds, exercises for training articulation and auditory comprehension.
Speech processing is responsible for recognition of the learner’s speech.
Error detection component processes the recognized speech of the student and identifies pronunciation errors.
1.2. Individual error detection
In comparison with overall learner’s pronunciation evaluation (the interested reader can consult (Eskenazi, 2009) for a detailed explanation of this pronunciation correctness measure), individual error detection is a much more difficult issue due to a high complexity of automatic speech recognition task in general and unresolved problems of individual sound recognition in particular, so this issue is still an open question and an area of ongoing research. Until now, attempting to develop better methods for individual error detection, researches have suggested a number of procedures, the most representative of which are briefly reviewed in this section.
Weigelt, Sadoff, and Miller (1990) used decision trees to discriminate between voiceless fricatives and voiceless plosives using three measures of the waveform. The authors did not apply their results directly to error detection although such application was implied. Later, this method was put into practice by Truong, Neri, Cucchiarini and Strik (2004) in order to identify errors in three Dutch sounds /A/, /Y/, and /X/, often pronounced incorrectly by L2 learners of Dutch. The classifiers used acoustic-phonetic features (amplitude, rate of rise, duration) to discriminate correct realizations of these sounds. Truong et al. (2004) also used classifiers based on Linear Discriminant Analysis (LDA) obtaining positive results. Strik et al. (2009) performed further experiments with the method in (Weigelt et al., 1990) and compared it to other three methods, namely, Goodness of Pronunciation, Linear Discriminant Analysis with acoustic-phonetic features, and Linear Discriminant Analysis with mel-frequency cepstrum coefficients. The analysis was done for the same three Dutch sounds as in (Truong et al., 2004).
The error detection task was studied for languages other than Dutch. Zhao, Hoshino, Suzuki, Minematsu and Hirose (2012) used Support Vector Machines with structural features to identify Chinese pronunciation errors of Japanese learners. A decision tree algorithm was used in the work of Ito, Lim, Suzuki and Makino (2005) to identify English pronunciation errors in the speech of Japanese native speakers. The same task was pursued for Korean learners of English in the work of Yoon, Hasegawa-Johnson, and Sproat (2010) using a combination of confidence scoring at the phone level and landmark-based Support Vector Machines. Menzel, Herron, Bonaventura and Morton (2000) used the confidence scores provided by an HMM-based speech recognizer to localize English pronunciation errors of Italian and German speakers.
However, compared to human judgment, automatic erroneous sound detection is not at all satisfactory (Strik et al., 2009). We believe that error detection rate can be improved by using error patterns as guidelines for predicting errors in learner’s speech.
2. Methodology
We based our comparative analysis of the consonants of American English (AE) and Mexican Spanish (MS) and identification of their similarities and differences on a detailed study of literature on the issue of English and Spanish phonology and phonetics published to date. We chose those publications which provide a fine-grained description of the respective sound systems specifying the features of phonemes and their most frequently met allophones: Whitley (1986), Avery and Ehrlich (1992), Edwards (1997), Quilis (1997), Moreno de Alba (2001), Pineda, Castellanos, Cuétara, Galescu, Juárez, Llisterri, Pérez and Villaseñor (2010).
We paid special attention to the existing literature on the issues of teaching English pronunciation to Spanish speakers. Unfortunately, such resources are scarce. The fullest courses are ‘English Phonetics and Phonology for Spanish Speakers’ by Mott (2005) and ‘A Course in English Phonetics for Spanish Speakers’ by Finch and Ortiz Lira (1982), but they teach British English to Castilian Spanish speakers. Such books like ‘Teaching English Sounds to Spanish Speakers' by Schneider (1971), ‘English Pronunciation for Spanish Speakers: Vowels’ by Dale (1985), ‘English Pronunciation for Spanish Speakers: Consonants’ by Dale and Poms (1986) teach American English, but are limited to some aspects of pronunciation and do not consider Mexican Spanish peculiarities.
Having studied the description of English and Spanish consonants in the state of the art literature mentioned above, we made their theoretic comparison and organized our observations in such a way that makes it easy to see similarities and differences of two consonant systems. The results of our work are presented in the next section.
3. Comparative description of AE and MS consonants
Each sound is described using the following order. First, we indicate if a given sound is American English (AE) or Mexican Spanish (MS). Then the phonetic descriptors, or features, are listed. The phoneme sign is given in forward slashes, and then an example word is presented. After that, the basic allophones of the sound are given: additional phonetic feature/s distinguishing this allophone is/are specified, the allophone symbol is given in brackets followed by an example (word or word combination) in which this allophone is used; last, we explain in what contexts and under what conditions this allophone is produced. Additionally, every example word is transcribed; its narrow transcription is given in brackets. Throughout the text we used the IPA symbols (https://www.internationalphoneticassociation.org/content/ipa-chart).
3.1. Stop consonants
AE voiceless bilabial /p/ as in pet ENT&[petENT]. Allophones:
/p/ with aspirated release ENT&[pʰENT] as in 'poke' ENT[pʰoʊkENT&], occurs in word-initial and stressed positions;
/p/ with unaspirated release ENT[p˭ENT&] as in 'spot' ENT[sp˭ɑtENT], occurs in consonant clusters, especially after /s/;
/p/ with nasal release ENT[p̃ENT] as in 'stop ’em' ENT[stɑp̃m̩ENT], occurs before a syllabic nasal;
unreleased ENT[p-ENT] as in 'to'p ENT[tɑp-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[p:ENT] as in 'stop Pete' ENT[ˈstɑpːitENT], occurs when /p/ arrests and releases adjoining syllable(s);
preglottalized ENT[ʔpENT] as in 'conception' ENT[kənˈsɛʔpʃnENT], occurs syllable-finally, before nasals or obstruents.
MS voiceless bilabial unaspirated /p/ as in poco ENT[ˈpokoENT], occurs in all environments.
AE voiced bilabial /b/ as in 'bet' ENT[betENT]. Allophones:
/b/ with nasal release ENT[b̃ENT] as in 'rob him' ENT[rɑb̃m̩ENT], occurs before a syllabic nasal;
unreleased ENT[b-ENT] as in 'rob' ENT[rɑb-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[b:ENT] as in 'rob Bob' ENT[ˈrɑbːˈbɑbːENT], occurs when /b/ arrests and releases adjoining syllable(s);
MS voiced bilabial /b/ as in van ENT[banENT]. Allophones:
ENT[bENT] as in van ENT[banENT], occurs after a pause (phrase-initially, word-initially) or a nasal consonant.
approximant (spirantized) ENT[β̞ENT] as in haba ENT[ˈaβ̞aENT], occurs in complementary distribution with ENT[bENT].
MS voiced dental /d/ as in dar ENT[darENT]. Allophones:
ENT[dENT] as in dar ENT[darENT], occurs after a pause (phrase-initially, word-initially), a nasal consonant or /l/;
approximant (spirantized) ENT[ð̞ENT] as in nada ENT[ˈnað̞aENT], occurs in complementary distribution with ENT[dENT].
MS voiceless dental unaspirated /t/ as in tío ENT[ˈtɪoENT], occurs in all environments.
AE voiceless alveolar /t/ as in 'ten' ENT[tenENT]. Allophones:
/t/ with aspirated release ENT[tʰENT] as in 'tape' ENT[tʰeɪpENT], occurs in word-initial and stressed positions;
/t/ with unaspirated release ENT[t˭ENT] as in 'stop' ENT[st˭ɒpENT], occurs in consonant clusters, especially after /s/;
/t/ with nasal release ENT[t̃ENT] as in 'button' ENT[bʌt̃n̩ENT], occurs before a syllabic nasal;
unreleased ENT[t-ENT] as in 'coat' ENT[kot-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[t:ENT] as in 'let Tim' ENT[ˈletːˈɪmENT], occurs when /t/ arrests and releases adjoining syllable(s);
dentalized ENT[t̪ENT] as in 'eighth 'ENT[eɪt̪θENT], occurs before an interdental;
flapped ENT[ɾENT] as in 'lette'r ENT[ˈleɾəENT], occurs intervocalically when second vowel is unstressed;
preglottalized ENT[ʔtENT] as in 'atlas' ENT[ˈæʔtləsENT], occurs syllable-finally, before nasals or obstruents;
glottal stop ENT[ʔENT] as in 'button' ENT[bʌʔnENT], occurs before ENT[n̩ENT] or ENT[l̩ENT];
affricated (palatalized) ENT[tʃr̥ENT] as in 'train' ENT[tʃr̥eɪnENT], occurs word-initially before /r/;
affricated (palatalized) ENT[tʃENT] as in 'eat yet' ENT[ˈitʃətENT] occurs when /t/ is followed by /j/ + unstressed vowel.
AE voiced alveolar /d/ as in 'den' ENT[denENT]. Allophones:
/d/ with bilateral release ENT[d‿lENT] as in 'cradle' ENT[kreɪd‿lENT], occurs before /l/;
/d/ with nasal release ENT[d̃ENT] as in 'rod ’n reel' ENT[rɑd̃n̩rilENT], occurs before a syllabic nasal;
unreleased ENT[d-ENT] as in 'dad' ENT[dæːd-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[d:ENT] as in 'sad Dave' ENT[ˈsæːˈdːevENT], occurs when /d/ arrests and releases adjoining syllable(s);
dentalized ENT[d̪ENT] as in 'width' ENT[wɪd̪θENT], occurs before an interdental;
flapped ENT[ɾENT] as in 'ladder' ENT[ˈlæɾəENT], occurs intervocalically when second vowel is unstressed;
affricated (palatalized) ENT[dʒrENT] as in 'drain' ENT[dʒreɪnENT], occurs word-initially before /r/;
affricated (palatalized) ENT[dʒENT] as in 'did you' ENT[ˈdɪdʒəENT], occurs when /d/ is followed by /j/ + unstressed vowel.
AE voiceless velar /k/ as in cap ENT[kæpENT]. Allophones:
/k/ with aspirated release ENT[kʰENT] as in 'keep' ENT[kʰipENT], occurs in word-initial and stressed positions;
/k/ with unaspirated release ENT[k˭ENT] as in 'skope' ENT[sk˭opENT], occurs in consonant clusters, especially after /s/;
/k/ with bilateral release ENT[k‿lENT] as in 'clock' ENT[k‿lɑkENT], occurs before /l/;
/k/ with nasal release ENT[k̃ENT] as in 'beacon' ENT[bik̃n̩ENT], occurs before a syllabic nasal;
unreleased ENT[k-ENT] as in 'take' ENT[teɪk-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[k:ENT] as in 'take Kim' ENT[teɪkːɪmENT], occurs when /k/ arrests and releases adjoining syllable(s);
preglottalized ENT[ʔkENT] as in 'technical' ENT[ˈtɛʔknɪk‿lENT], occurs syllable-finally, before nasals or obstruents;
glottal stop ENT[ʔENT] as in 'bacon' ENT[beɪʔn̩ENT], occurs before ENT[n̩ENT] or ENT[l̩ENT].
MS voiced velar unaspirated /k/ as in cama ENT[ˈkamaENT]. Allophones:
ENT[kENT] as in casa ENT[ˈkasaENT], occurs before front vowels and in consonant clusters;
palatalized ENT[kʲENT] as in queso ENT[ˈkʲesoENT], occurs in complementary distribution with ENT[kENT].
AE voiced velar /ɡ/ as in 'gap' ENT[ɡæpENT]. Allophones:
/ɡ/ with bilateral release ENT[ɡ‿lENT] as in 'glee' ENT[ɡ‿liENT], occurs before /l/;
/ɡ/ with nasal release ENT[ɡ̃ENT] as in 'pig and goat 'ENT[ˈpɪɡ̃n̩ˈɡotENT], occurs before a syllabic nasal;
unreleased ENT[ɡ-ENT] as in 'flag' ENT[fl̥æɡ-ENT], occurs word-finally and in some blend positions or clusters;
lengthened ENT[ɡ:ENT] as in 'big grapes' ENT[ˈbɪˈɡːreɪpsENT], occurs when /ɡ/ arrests and releases adjoining syllable(s).
MS voiced velar /ɡ/ as in gato ENT[ˈɡatoENT]. Allophones:
ENT[ɡENT] as in gasto ENT[ˈɡastoENT], occurs after a pause (phrase-initially, word-initially) or a nasal consonant;
approximant (spirantized) ENT[ɣ̞ENT] as in el gasto ENT[elˈɣ̞astoENT], occurs in complementary distribution with ENT[ɡENT].
3.2. Fricative consonants
AE voiceless labiodental /f/ as in 'fan' ENT[fænENT]. Allophones:
interdental ENT[θENT] as in 'trough' ENT[trɑθENT], occurs in certain words;
bilabial ENT[ɸENT] as in 'comfort' ENT[ˈkʌmɸətENT], occurs after a labial.
MS voiceless bilabial /f/ as in foco ENT[ˈfokoENT], occurs in all environments.
AE voiced labiodental /v/ as in ' van' ENT[vænENT]. Allophone:
devoiced ENT[v̥ENT] as in 'have to' ENT[ˈhæv̥təENT], occurs word-finally, before or after a voiceless consonant.
MS voiceless dental /s̪/ as in Asia ENT[ˈas̪jaENT], occurs in all environments.
AE voiceless interdental /θ/ as in 'thigh' ENT[θaɪENT]. Allophone:
voiced ENT[ðENT] as in 'with many' ENT[wɪðˈmenɪENT], occurs in coarticulation with a voiced consonant.
AE voiced interdental /ð/ as in 'th'y ENT[ðaɪENT]. Allophone:
devoiced ENT[ð̥ENT] as in 'This is not theirs ENT[ð̥ɪsɪz ˈnɒʔˈð̥ɛˑəzENT], occurs before and after voiceless consonants and pauses.
AE voiceless alveolar /s/ as in 'sip' ENT[sɪpENT]. Allophone:
MS voiceless dorosalveolar /s/ as in sol ENT[solENT]. Allophones:
palatalized ENT[ʒENT] as in pues ya ENT[puˈeʒaENT], occurs before a palatal consonant in rapid speech;
voiced ENT[zENT] as in mismo ENT[ˈmizmoENT], occurs intervocalically or between a vowel and a voiced consonant.
AE voiced alveolar /z/ as in 'zip' ENT[zɪpENT]. Allophones:
devoiced ENT[z̥ENT] as in 'keys' ENT[kiz̥ENT], occurs word-finally, before or after voiceless consonants;
palatalized ENT[ʒENT] as in 'as you' ENT[æˈʒjuENT], occurs before /j/;
stopping ENT[dENT] as in 'busines' ENT[ˈbɪdnɪsENT], occurs in selected words.
AE voiceless palatal /ʃ/ as in 'mesher' ENT[ˈmeʃəENT], occurs in all positions.
MS voiceless palatal /ʃ/ as in Xola ENT[ˈʃolaENT].
AE voiced palatal /ʒ/ as in 'measure' ENT[ˈmeʒəENT]. Allophone:
MS voiced dorsal palatal /ʝ/ as in yo ENT[ʝoENT], occurs at the beginning of a syllable.
MS voiceless velar /x/ as in paja ENT[ˈpaxaENT].
AE voiceless glottal /h/ as in 'hat' ENT[hætENT]. Allophones:
voiced ENT[ɦENT] as in 'ahead 'ENT[əˈɦedENT], occurs intervocalically;
palatalized ENT[çENT] as in 'hue' ENT[çjuENT], occurs when produced tensely;
/h/ with glottal release ENT[ʔENT] as in 'hello' ENT[ʔeˈləʊENT], occurs word-initially in some words;
omitted ENT[øENT] as in 'he has his' ENT[hi hæzɪzENT], occurs when unstressed.
3.3. Affricate consonants
AE voiceless alveo-palatal /tʃ/ as in 'chin' ENT[tʃɪnENT].
AE voiced alveo-palatal /dʒ/ as in 'gin' ENT[dʒɪnENT].
MS voiceless palatal /t͡ʃ/ as in hacha ENT[at͡ʃaENT].
3.4. Approximant consonants
AE voiced labiovelar glide /w/ as in wed ENT[wedENT]. Allophones:
aspirated ENT[hwENT] as in 'where' ENT[hweəENT], occurs in wh-words;
devoiced ENT[w̥ENT] as in 'twenty' ENT[ˈtw̥entɪENT], occurs in voiceless clusters.
MS voiced alveolar thrill /r/ as in perro ENT[ˈperoENT]. Allophones:
devoiced hushing sibilant ENT[r̥ʃENT] as in ver ENT[ber̥ʃENT], occurs word-finally mostly in female speech;
sibilant flap ENT[ɾENT] as in pero ENT[ˈpeɾoENT], occurs between vowels.
AE voiced alveopalatal liquid /r/ as in 'red' ENT[redENT]. Allophones:
devoiced ENT[r̥ENT] as in 'treat' ENT[tr̥itENT], occurs in voiceless clusters;
flap ENT[ɾENT] as in 'very' ENT[ˈveɾɪENT], occurs between vowels;
retroflexed ENT[ɻENT] as in 'right' ENT[ɻaɪtENT], occurs in selected words;
back ENT[r̙ENT] as in 'grey' ENT[ɡr̙eɪENT], occurs before or after /ɡ/, /k/.
AE voiced palatal glide /j/ as in 'yet' ENT[jetENT]. Allophones:
omitted ENT[øENT] as in 'duty' ENT[ˈdutɪENT], occurs after a consonant other than a stop one;
devoiced ENT[j̥̊ENT] as in 'pure' ENT[pʰj̥̊uəENT], occurs after a voiceless stop consonant.
AE voiced alveolar lateral liquid /l/ as in 'led' ENT[ledENT]. Allophones:
light ENT[lENT] as in 'lease' ENT[lisENT], occurs before a vowel;
dark, velarized ENT[ɫENT] as in 'call' ENT[kɔɫENT], occurs after a vowel;
syllabic, also dark ENT[l̩ENT] as in 'bottle' ENT[bɑʔl̩ENT], occurs in clusters;
devoiced ENT[l̥ENT] as in 'play' ENT[pl̥eɪENT], occurs in voiceless clusters;
dentalized ENT[ɫ̥ENT] as in 'health' ENT[hɛɫ̥θENT], occurs before /θ/, /ð/.
3.5. Nasal consonants
AE voiced bilabial /m/ as in 'met' ENT[metENT]. Allophones:
syllabic ENT[m̩ENT] as in 'something' ENT[ˈsʌm̩θɪŋENT], occurs in clusters;
lengthened ENT[m:ENT] as in 'some more' ENT[sʌˈm:ɔrENT], occurs when arrests and releases adjoining syllable(s);
labiodentalized ENT[ɱENT] as in 'comfort' ENT[ˈkʌɱfətENT], occurs before /f/ or /v/.
MS voiced bilabial /m/ as in más ENT[masENT].
MS voiced dental /n̪/ as in antes ENT[ˈan̪tesENT].
AE voiced alveolar /n/ as in 'net' ENT[netENT]. Allophones:
syllabic ENT[n̩ENT] as in 'button' ENT[bʌʔn̩ENT], occurs in clustes;
lengthened ENT[n:ENT] as in 'ten names' ENT[ten:eɪmzENT], occurs when arrests and releases adjoining syllable(s);
labildentalized ENT[ɱENT] as in 'invite' ENT[ɪɱˈvaɪtENT], occurs before /f/ or /v/;
dentalized ENT[n̪ENT] as in 'on Thursday' ENT[ən̪ˈθɝzdeENT], occurs before /θ/, /ð/;
velarized ENT[ŋ̩ENT] as in 'income' ENT[ˈɪŋkəmENT], occurs before /k/ or /ɡ/.
MS voiced alveolar /n/ as in nene ENT[ˈneneENT]. Allophones:
dentalized ENT[n̪ENT] as in cuantoENT[ˈkwan̪toENT], occurs before /t/ or /d/;
velarized ENT[ŋ̩ENT] as in banco ENT[ˈbaŋkoENT], occurs before a velar consonant.
MS voiced palatal /ɲ/ as in año ENT[aɲoENT].
AE voiced velar /ŋ/ as in 'lun'g ENT[lʌŋENT]. Allophones:
4. Error patterns
In this section, we propose some basic hypothetical error patterns on the phoneme level. They are derived theoretically from the results of comparing AE and MS consonant sound systems given in Section 3. Certainly, such a theoretical approach is not sufficient to identify all possible errors of an MS learner of English. Practical research is necessary to confirm, clarify, adjust, or correct the theoretically predicted errors listed in this section. Also, more error patterns may be discovered in an empirical study of English speech produced by MS learners. We plan to do this research as future work.
Basically, all phoneme errors can be classified into three types which we present in the following three subsections, respectively, (1) substitution of an AE phoneme by an MS phoneme, (2) insertion of an MS phoneme in an AE word, and (3) deletion of an AE phoneme. There are two main reasons which explain why pronunciation errors are made: the first reason is phonetic, that is, a given AE sound does not exist in MS or if it exists, it differs in some way; the second reason is orthographic, when the MS reading rules are applied to AE words. For example, ‘haste’ may be read as ENT[eɪstENT] instead of ENT[heɪstENT] because the letter h is not pronounced in all contexts in Spanish. However, knowing that the English h must be pronounced, an MS learner may read it as voiceless velar /x/ instead of AE voiceless glottal /h/ since /x/ is the MS consonant most similar to the AE /h/.
In Section 4.1 substitution error patterns are shown. We put the comment “due to orthography”, if an error is made for this reason. If the reason is phonetic, we offer no comment. In Section 4.2 insertion errors are listed; they are caused by the influence of MS orthographic patterns and reading rules. Section 4.3 speaks about deletion errors.
4.1. Substitution
Table 1. Substitution errors.
AE consonant | Substituted by MS consonant |
---|---|
Stop voiceless consonants with aspirated release ENT[pʰENT], ENT[tʰENT], ENT[kʰENT] as in 'pound', 'pitch', 'pancake', 'teeth',' touch', 'tin', 'cake', 'cast', 'coke' | Unaspirated release ENT[pENT], ENT[tENT], ENT[kENT] |
Stop voiced bilabial /b/ as in bet ENT[betENT] used in inter-vocal positions as in 'liberal', 'debate', 'forbade', 'possibility', 'diabolical' | Approximant (spirantized) ENT[β̞ENT] as in haba ENT[ˈaβ̞aENT] |
Stop voiced alveolar /d/ as in den ENT[denENT] used in inter-vocal positions as in 'individual', 'prejudice', prudence, intruder, tedious | Approximant (spirantized) ENT[ð̞ENT] as in nada ENT[ˈnað̞aENT] |
Stop voiced velar /ɡ/ as in 'gap' ENT[ɡæpENT] used in non-initial position as in 'regain', 'extravagant', 'plaguing',' regard', 'agony' | Approximant (spirantized) ENT[ɣ̞ENT] as in el gasto ENT[elˈɣ̞astoENT] |
Fricative voiceless interdental /θ/ as in 'thigh' ENT[θaɪENT] | Stop voiceless dental unaspirated /t/ as in tío ENT[ˈtɪoENT] |
Fricative voiced interdental /ð/ as in 'thy' ENT[ðaɪENT] | Stop voiced alveolar /d/ as in den ENT[denENT] |
Fricative voiceless glottal /h/ as in 'hat' ENT[hætENT] | Fricative voiceless velar /x/ as in paja ENT[ˈpaxaENT] |
Fricative voiced labiodental /v/ as in 'van' ENT[vænENT]: due to orthography | Stop voiced bilabial /b/ as in van ENT[banENT] |
Fricative voiced alveolar /z/ as in 'zip' ENT[zɪpENT] | Fricative voiceless dorosalveolar /s/ as in sol ENT[solENT] |
Approximant voiced alveopalatal liquid /r/ as in 'red' ENT[redENT] | Approximant voiced alveolar thrill /r/ as in perro ENT[ˈperoENT] |
Nasal voiced velar /ŋ/ as in 'lung' ENT[lʌŋENT] | Nasal voiced alveolar /n/ as in nene ENT[ˈneneENT] |
4.2. Insertion
Consonant insertion is a rare phenomenon; insertion errors are typical for vowels. However, consonants may be inserted primarily for orthographic reasons; one example is so-called silent consonants in AE: b in comb, numb, debt, 'c' in muscle, scissors, 'd' in Wednesday, sandwich, handsome, 'g' in sign, gnaw, high, reign, 'k' in knock, know, knife, 'l' in salmon, calf, talk, 'm' in mnemonic, 'n' in autumn, column, solemn, 'p' in pneumonia, psychology, receipt, 's' in island, 'w' in answer, swart, two, etc. Since these letters are read in MS, English L2 learners tend to insert the corresponding consonants.
4.3. Deletion
The phenomenon of phoneme deletion is typical for consonant sounds, especially in word final positions since the latter is typical in MS. For instance, /s/ is deleted in final position in mas ENT[masENT] in the combination más rapido ENT[ˈma ˈrapidoENT]. Deletion may occur in other environments; an example of this is deletion of initial /h/ in 'haste' considered previously in the same section.
5. Error detection using patterns
Error detection and correction are very important in language learning. In the computer assisted pronunciation training models described in Section 1, the learner’s errors are to be detected automatically followed by generation of relevant explanations, teaching instructions, and corrective exercises. As we mentioned in Section 1.2, automatic error detection at the level of individual sounds is a complex task which can be enhanced by error patterns.
As an example, consider the word 'jungle' ENT[ˈdʒʌŋɡlENT]. We suggest that two types of transcription should be stored in the phonetic database: the correct transcription and the transcription including possible erroneous sounds annotated with their probabilities; see Table 2. In case the word pronounced by the learner differs significantly from the correct version based on a pre-defined threshold, the error detection model will take into account error pattern probabilities in order to identify the concrete error.
Table 2. Vowel pronunciation errors in the word 'jungle'.
Correct | Incorrect | ||
---|---|---|---|
ENT[ˈdʒʌŋɡlENT] | Transcription | Probability | Reason |
ENT[ˈhʌŋɡlENT] | 0.50 | Orthographic | |
ENT[ˈjʌŋɡlENT] | 0.20 | Substitution of /dʒ/ with /j/ | |
ENT[ˈʝʌŋɡlENT] | 0.20 | Substitution of /dʒ/ with /ʝ/ | |
ENT[ˈdjʌŋɡlENT] | 0.10 | Substitution of /dʒ/ with /dj/ |
6. Examples of Error-Preventive AE Sound Training
In this section we give two examples of teaching AE sounds to MS speakers taking into account the information presented in Sections 3 and 4. These examples show how the results of our comparative analysis can be applied in developing error preventing methods in pronunciation training. Example 1 includes an AE sound which does not exist in MS as a phoneme, while it appears as an allophone of another phoneme. Example 2 involves an AE phoneme absent in MS on the level of both phoneme and allophone. In both examples, the teaching is realized in the following stages: (1) AE phoneme presentation and explanation of its articulation in comparison with similar MS sound/s, (2) training of the AE phoneme first using MS words with similar sound/s and then AE words of increasing complexity, (3) training of auditory recognition of the AE phoneme first using minimal pairs, then words of increasing complexity, word combinations and phrases depending on the student’s level (elementary, intermediate, advanced).In both example we refer to these three stages.
The three stages of AE phoneme training can be incorporated by a CAPT system whose main modules are shown in Figure 2. In Section 1 we mentioned the University of Iowa phonetic application (see Figure 1), in which the learner can find descriptions and visual representations of English and Spanish phonemes, however, such diagrams are located in two separate modules of that system-English and Spanish-and they have no interaction. We believe that an improved model is to be built on the contrastive interactive principle which will be more effective for training new phonemes and their allophones. We illustrate this idea by the following two examples accompanying them by the diagrams from the University of Iowa phonetic application.
Example 1
The phoneme ŋ as in lung ENT[lʌŋENT] does not exist in the MS phonemic system. Nevertheless, from Table 1 it is clear that /ŋ/ is the /n/ allophone generated in combination of /n/ with velar consonant phonemes /k/ (banco ENT[ˈbaŋkoENT]), /g/ (pongo ENT[ˈpoŋɡoENT]), /x/ (angel ENT[ˈaŋxelENT]); therefore this allophone can be used for explaining ŋ articulation at stage 1 and initial /ŋ/ training at stage 2. The explanation may begin with the comment that /ŋ/ is a sound similar to the sound produced in MS words like banco, pongo, angel. These words are simple and of common usage so they are suitable for explanation, though for the training stage angel is not relevant because AE /ŋ/ does not combine with /h/, the phoneme most close to the MS /x/. The learner is asked to prolong the sound corresponding to the letter n in pongo (pon-n-n-ngo) thus becoming conscious of its articulation and acoustic features. Stage 1 may be accompanied by a picture (or animation) of speech organs for /ŋ/ articulation and a recording of ŋ sounding separately as well as in MS words which appear on the screen.
At stage 2, the learner is first exposed to simple AE words where the phoneme /ŋ/ appears in similar surroundings as the MS words practiced before: /ŋ/+/k/ 'drink', 'uncle', 'increase'; /ŋ/+/ɡ/ 'singer', 'language', 'younger'. Next, /ŋ/ is introduced in combinations typical only for AE: /ŋ/+/z/ 'brings', 'thins', 'songs'; word-final /ŋ/ 'ring', 'hang', 'long', 'doing', 'nothing'. Stage 3 is devoted to auditory comprehension of AE words containing /ŋ/. Initially, the words practiced at stage 2 are presented to the learner, then other words of increasing complexity including minimal pairs (e.g. 'sin' - 'sing', 'sun' - 'sung', 'fan' - 'fang'), afterwards, short and longer phrases. At each stage, pronunciation errors are identified, explained to the learner contrasting /ŋ/ in MS and AE words, and corrected by additional exercises. Error detection process is facilitated by predicted error patterns using the results presented in Section 3. Figure 3(a) illustrates the similarity and differences of /ŋ/ and /n/.
Example 2
AE voiced alveo-palatal /dʒ/ as in 'gin' ENT[dʒɪnENT] does not exist in MS as a phoneme, neither it is observed on the allophone level. However, there are MS sounds that are similar to the components of /dʒ/: dental /d/ as in dar ENT[darENT] and dorsal palatal /ʝ/ as in yo ENT[ʝoENT]. So, stage 1 may begin with an explanation of this fact as well as of the differences between MS dental /d/ and AE alveolar /d/, and between MS dorsal palatal /ʝ/ and AE palatal /ʒ/ as in 'measure' ENT[ˈmeʒəENT]. Then, a learner should practice both /d/ and /ʒ/ at stage 2. When the student is able to generate both AE sounds in a reasonably correct manner, s/he should be told that the two sounds must be pronounced in a connected and continuous way. The learner is to only begin articulating /d/ but instead of pronouncing it completely, the tongue must be moved down to make the /ʒ/ sound. This training stage in fact belongs to stage 1, so after practicing the components of /dʒ/, the student goes back to stage 1 to get more explanation, and then proceeds with training of /dʒ/ in various positions within words and then phrases. Figure 3(b) illustrates the similarity and differences of the respective AE and MS sounds.
CONCLUSIONS
In this paper, we presented the results of our detailed comparative analysis of American English (AE) and Mexican Spanish (MS) consonants on the level of both phonemes and allophones. It is a significant contribution to this research filed as such analysis had not been done in previous work. The results of our analysis are detailed contrastive descriptions of all AE and MS consonant phonemes and their most frequently observed allophones presented in such a way that it is easy to notice and explore similarities and differences in the two consonant systems.
As a possible practical application of our results we considered Computer Assister Pronunciation Training model for teaching AE pronunciation to MS speakers. In this model, the descriptions of consonants in this article can be used for a more effective automatic individual error detection. The latter will allow for generation of a relevant feedback and presenting it to the learner. Error identification and adequate feedback generation are open research issues since the existing applications still operate on these tasks with a low precision compared to human judgment. We showed how the differences and similarities between the consonant systems of AE and MS presented in this work can be used for designing error patters to be used for mispronunciation prediction thus improving the performance of intelligent tutor applications.
Another usage of our results is development of teaching strategies which anticipate and prevent possible AE pronunciation errors in the speech of MS students. We presented two examples of how teaching articulation and auditory comprehension can be enhanced when typical error patterns are known in advance.
In future, we plan to compare the results of our theoretic phonetic analysis with errors observed empirically in learners’ speech production in order to introduce modifications in error patterns proposed by us if necessary and to define a comprehensive list of error patterns. Such a list will be a valuable resource in L2 English pronunciation training via a human instructor and/or an intelligent tutor model.