Cookies Policy

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies.

I accept this policy

Find out more here

The Time Course of Audio-Visual Phoneme Identification: a High Temporal Resolution Study

No metrics data to plot.
The attempt to load metrics for this article has failed.
The attempt to plot a graph for these metrics has failed.
The full text of this article is not currently available.

Brill’s MyBook program is exclusively available on BrillOnline Books and Journals. Students and scholars affiliated with an institution that has purchased a Brill E-Book on the BrillOnline platform automatically have access to the MyBook option for the title(s) acquired by the Library. Brill MyBook is a print-on-demand paperback copy which is sold at a favorably uniform low price.

Access this article

+ Tax (if applicable)
Add to Favorites
You must be logged in to use this functionality

image of Multisensory Research
For more content, see Seeing and Perceiving and Spatial Vision.

Speech unfolds in time and, as a consequence, its perception requires temporal integration. Yet, studies addressing audio-visual speech processing have often overlooked this temporal aspect. Here, we address the temporal course of audio-visual speech processing in a phoneme identification task using a Gating paradigm. We created disyllabic Spanish word-like utterances (e.g., /pafa/, /paθa/, …) from high-speed camera recordings. The stimuli differed only in the middle consonant (/f/, /θ/, /s/, /r/, /g/), which varied in visual and auditory saliency. As in classical Gating tasks, the utterances were presented in fragments of increasing length (gates), here in 10 ms steps, for identification and confidence ratings. We measured correct identification as a function of time (at each gate) for each critical consonant in audio, visual and audio-visual conditions, and computed the Identification Point and Recognition Point scores. The results revealed that audio-visual identification is a time-varying process that depends on the relative strength of each modality (i.e., saliency). In some cases, audio-visual identification followed the pattern of one dominant modality (either A or V), when that modality was very salient. In other cases, both modalities contributed to identification, hence resulting in audio-visual advantage or interference with respect to unimodal conditions. Both unimodal dominance and audio-visual interaction patterns may arise within the course of identification of the same utterance, at different times. The outcome of this study suggests that audio-visual speech integration models should take into account the time-varying nature of visual and auditory saliency.

Affiliations: 1: 1Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain ; 2: 2Université Grenoble Alpes, GIPSA-lab (CNRS UMR 5216), Grenoble, France ; 3: 3Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain

*To whom correspondence should be addressed. E-mail:

Full text loading...


Data & Media loading...

1. Abel J., Barbosa A. V., Black A., Mayer C., Vatikiotis-Bateson E. (2011). The labial viseme reconsidered: Evidence from production and perception, in: 9th International Seminar on Speech Production (ISSP11), Montreal, QB, Canada, pp. 337–344.
2. Abry C., Lallouache M. T., Cathiard M. A. (1996). "How can coarticulation models account for speech sensitivity to audio-visual desynchronization?", in: Speechreading by Humans and Machines, Vol Vol. 150, Stork D., Hennecke M. (Eds), pp.  247256. Springer-Verlag, Berlin, Germany. [Crossref]
3. Alsius A., Soto-Faraco S. (2011). "Searching for audiovisual correspondence in multiple speaker scenarios", Exp. Brain Res. Vol 213, 175183. [Crossref]
4. Alsius A., Möttönen R., Sams M. E., Soto-Faraco S., Tiippana K. (2014). "Effect of attentional load on audiovisual speech perception: evidence from ERPs", Front. Psychol. Vol 5, 727. DOI:10.3389/fpsyg.2014.00727. [Crossref]
5. Altieri N., Townsend J. T. (2011). "An assessment of behavioral dynamic information processing measures in audiovisual speech perception", Front. Psychol. Vol 2, 238. DOI:10.3389/fpsyg.2011.00238. [Crossref]
6. Arnal L. H., Morillon B., Kell C. A., Giraud A. L. (2009). "Dual neural routing of visual facilitation in speech processing", J. Neurosci. Vol 29, 1344513453. [Crossref]
7. Barrós-Loscertales A., Ventura-Campos N., Visser M., Alsius A., Pallier C., Ávila Rivera C., Soto-Faraco S. (2013). "Neural correlates of audiovisual speech processing in a second language", Brain Lang. Vol 126, 253262. [Crossref]
8. Benoît C., Mohamadi T., Kandel S. (1994). "Effects of phonetic context on audio-visual intelligibility of French", J. Speech Lang. Hear. Res. Vol 37(5), 11951203. [Crossref]
9. Birulés-Muntané J., Soto-Faraco S. (2016). "Watching subtitled films can help learning foreign languages", PloS One Vol 11, e0158409. DOI:10.1371/journal.pone.0158409. [Crossref]
10. Boersma P., Weenink D. (2017). Praat (Version 4.5.25). Available from
11. Brunellière A., Sánchez-García C., Ikumi N., Soto-Faraco S. (2013). "Visual information constrains early and late stages of spoken-word recognition in sentence context", Int. J. Psychophysiol. Vol 89, 136147. [Crossref]
12. Burnham D. K. (1998). "Language specificity in the development of auditory-visual speech perception", in: Hearing by eye II: Advances in the Psychology of Speechreading and Auditory–Visual Speech, Campbell R., Dodd B., Burnham D. (Eds), pp.  2760. Psychology Press, Hove, UK.
13. Calvert G. A., Spence C., Stein B. E. (2004). The Handbook of Multisensory Processing. MIT Press, Cambridge, MA, USA.
14. Campbell R. (2004). "Audiovisual speech processing", in: The Encyclopedia of Language and Linguistics, 2nd edn. Brown K. (Ed.). Elsevier, Oxford, UK.
15. Campbell R. (2008). "The processing of audio-visual speech: empirical and neural bases", Phil. Trans. R. Soc. B Biol. Sci. Vol 363(1493), 10011010. [Crossref]
16. Cathiard M. A., Tiberghien G., Tseva A., Lallouache M. T., Escudier P. (1991). Visual perception of anticipatory rounding during acoustic pauses: A cross-language study, in: Proceedings of the XIIth International Congress of Phonetic Sciences, Vol. 4, 19-24 Août, Aix-en-Provence, France, pp. 50–53.
17. Chandrasekaran C., Trubanova A., Stillittano S., Caplier A., Ghazanfar A. A. (2009). "The natural statistics of audiovisual speech", PLoS Comp. Biol. Vol 5, e1000436. DOI:10.1371/journal.pcbi.1000436. [Crossref]
18. Escudier P., Benoît C., Lallouache M.-T. (1990). "Identification visuelle de stimuli associés à l’opposition /i/-/y/: Étude statique, in: 1er Congrès Français d’Acoustique", J. Phys. Colloques Vol 51, C2-541C2-544. [Crossref]
19. Fernández L. M., Visser M., Ventura-Campos N., Ávila C., Soto-Faraco S. (2015). "Top-down attention regulates the neural expression of audiovisual integration", NeuroImage Vol 119, 272285. [Crossref]
20. Fisher C. G. (1968). "Confusions among visually perceived consonants", J. Speech Lang. Hear. Res. Vol 11, 796804. [Crossref]
21. Fort M., Spinelli E., Savariaux C., Kandel S. (2010). "The word superiority effect in audiovisual speech perception", Speech Commun. Vol 52, 525532. [Crossref]
22. Fort M., Kandel S., Chipot J., Savariaux C., Granjon L., Spinelli E. (2012). "Seeing the initial articulatory gestures of a word triggers lexical access", Lang. Cogn. Proc. Vol 28, 12071223. [Crossref]
23. Grant K. W., Walden B. E. (1996). "Evaluating the articulation index for auditory–visual consonant recognition", J. Acoust. Soc. Am. Vol 100, 24152424. [Crossref]
24. Grant K. W., Walden B. E., Seitz P. F. (1998). "Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration", J. Acoust. Soc. Am. Vol 103, 26772690. [Crossref]
25. Grosjean F. (1980). "Spoken word recognition processes and the gating paradigm", Percept. Psychophys. Vol 28, 267283. [Crossref]
26. Grosjean F. (1996). "Gating", Lang. Cogn. Proc. Vol 11, 597604. [Crossref]
27. Holmes N. P. (2007). "The law of inverse effectiveness in neurons and behaviour: multisensory integration versus normal variability", Neuropsychologia Vol 45, 33403345. [Crossref]
28. Jaekl P., Pesquita A., Alsius A., Munhall K., Soto-Faraco S. (2015). "The contribution of dynamic visual cues to audiovisual speech perception", Neuropsychologia Vol 75, 402410. [Crossref]
29. Jesse A., Massaro D. W. (2010). "The temporal distribution of information in audiovisual spoken-word identification", Atten. Percept. Psychophys. Vol 72, 209225. [Crossref]
30. Lahiri A., Marslen-Wilson W. D. (1991). "The mental representation of lexical form: a phonological approach to the recognition lexicon", Cognition Vol 38, 245294. [Crossref]
31. Luchsinger R., Arnold G. E. (1965). Voice–Speech–Language Clinical Communicology: Its Physiology and Pathology. Wadsworth Publ. Co., Belmont, CA, USA.
32. Massaro D. W. (1998). Perceiving Talking Faces: from Speech Perception to a Behavioral Principle. MIT Press, Cambridge, MA, USA.
33. McGurk H., MacDonald J. (1976). "Hearing lips and seeing voices", Nature Vol 264(5588), 746748. [Crossref]
34. Miller G. A., Nicely P. E. (1955). "An analysis of perceptual confusions among some English consonants", J. Acoust. Soc. Am. Vol 27, 338352. [Crossref]
35. Moradi S., Lidestam B., Rönnberg J. (2013). "Gated audiovisual speech identification in silence vs. noise: effects on time and accuracy", Front. Psychol. Vol 4, 3850. [Crossref]
36. Munhall K. G., Tohkura Y. (1998). "Audiovisual gating and the time course of speech perception", J. Acoust. Soc. Am. Vol 104, 530539. [Crossref]
37. Munhall K. G., Jones J. A., Callan D. E., Kuratate T., Vatikiotis-Bateson E. (2004). "Visual prosody and speech intelligibility head movement improves auditory speech perception", Psychol. Sci. Vol 15, 133137. [Crossref]
38. Navarra J., Soto-Faraco S. (2007). "Hearing lips in a second language: visual articulatory information enables the perception of second language sounds", Psychol. Res. Vol 71, 412. [Crossref]
39. Navarra J., Sebastián-Gallés N., Soto-Faraco S. (2005). "The perception of second language sounds in early bilinguals: new evidence from an implicit measure", J. Exp. Psychol. Hum. Percept. Perform. Vol 31, 912918. [Crossref]
40. Pannunzi M., Pérez-Bellido A., Pereda-Baños A., López-Moliner J., Deco G., Soto-Faraco S. (2015). "Deconstructing multisensory enhancement in detection", J. Neurophysiol. Vol 113, 18001818. [Crossref]
41. Pápai M., Soto-Faraco S. (2017). "Sounds can boost the awareness of visual events through attention without cross-modal integration", Sci. Rep. Vol 7, 41684. DOI:10.1038/srep41684. [Crossref]
42. Plant R. R., Hammond N., Turner G. (2004). "Self-validating presentation and response timing in cognitive paradigms: how and why?" Behav. Res. Meth. Instrum. Comput. Vol 36, 291303. [Crossref]
43. Psychology Software Tools, Inc. [E-Prime 2.0] (2012). Retrieved from
44. Robert-Ribes J., Schwartz J. L., Lallouache T., Escudier P. (1998). "Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise", J. Acoust. Soc. Am. Vol 103, 36773689. [Crossref]
45. Ronquest R. E., Levi S. V., Pisoni D. B. (2010). "Language identification from visual-only speech signals", Atten. Percept. Psychophys. Vol 72, 16011613. [Crossref]
46. Ross L. A., Saint-Amour D., Leavitt V. M., Javitt D. C., Foxe J. J. (2007). "Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments", Cereb. Cortex Vol 17, 11471153. [Crossref]
47. Rouger J., Fraysse B., Deguine O., Barone P. (2007). "McGurk effects in cochlear implanted deaf subjects", Brain Res. Vol 1188, 8799. [Crossref]
48. Sánchez-García C., Alsius A., Enns J. T., Soto-Faraco S. (2011). "Cross-modal prediction in speech perception", PloS One Vol 6, e25198. DOI:10.1371/journal.pone.0025198. [Crossref]
49. Sánchez-García C., Enns J. T., Soto-Faraco S. (2013). "Cross-modal prediction in speech depends on prior linguistic experience", Exp. Brain Res. Vol 225, 499511. [Crossref]
50. Schwartz M. F. (1968). "Identification of speaker sex from isolated, voiceless fricatives", J. Acoust. Soc. Am. Vol 43, 11781179. [Crossref]
51. Schwartz J. L., Savariaux C. (2014). "No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag", PLOS Comput. Biol. Vol 10, e1003743. DOI:10.1371/journal.pcbi.1003743. [Crossref]
52. Sebastián-Gallés N., Soto-Faraco S. (1999). "Online processing of native and non-native phonemic contrasts in early bilinguals", Cognition Vol 72, 111123. [Crossref]
53. Sebastián-Gallés N., Albareda-Castellot B., Weikum W. M., Werker J. F. (2012). "A bilingual advantage in visual language discrimination in infancy", Psychol Sci. Vol 23, 994999. [Crossref]
54. Smeele P. M. T. (1994). Perceiving speech: Integrating auditory and visual speech, Doctoral dissertation, Delft University of Technology, The Netherlands.
55. Smeele P. M. T., Sittig A. C., Heuven V. J. (1992). Intelligibility of audio-visually desynchronised speech: Asymmetrical effect of phoneme position, in: Proceedings of the International Conference on Spoken Language Processing, Banff, AB, Canada, Vol. 1, pp. 65–68.
56. Smits R. (2000). "Temporal distribution of information for human consonant recognition in VCV utterances", J. Phon. Vol 27, 111135. [Crossref]
57. Smits R., Warner N., McQueen J. M., Cutler A. (2003). "Unfolding of phonetic information over time: a database of Dutch diphone perception", J. Acoust. Soc. Am. Vol 113, 563574. [Crossref]
58. Soto-Faraco S., Navarra J., Weikum W. M., Vouloumanos A., Sebastián-Galles N., Werker J. F. (2007). "Discriminating languages by speech-reading", Percept. Psychophys. Vol 69, 218231. [Crossref]
59. Stein B. E., Stanford T. R., Ramachandran R., Perrault T. J. Jr, Rowland B. A. (2009). "Challenges in quantifying multisensory integration: alternative criteria, models, and inverse effectiveness", Exp. Brain Res. Vol 198, 113126. [Crossref]
60. Stelmachowicz P. G., Pittman A. L., Hoover B. M., Lewis D. E., Moeller M. P. (2004). "The importance of high-frequency audibility in the speech and language development of children with hearing loss", Arch. Otolaryngol. Head Neck Surg. Vol 130, 556562. [Crossref]
61. Stevenson R. A., Bushmakin M., Kim S., Wallace M. T., Puce A., James T. W. (2012). "Inverse effectiveness and multisensory interactions in visual event-related potentials with audiovisual speech", Brain Topogr. Vol 25, 308326. [Crossref]
62. Sumby W. H., Pollack I. (1954). "Visual contribution to speech intelligibility in noise", J. Acoust. Soc. Am. Vol 26, 212215. [Crossref]
63. Summerfield Q. (1987). "Some preliminaries to a comprehensive account of audio-visual speech perception", in: Hearing by eye: the Psychology of lip Reading, Dodd B., Campbell R. (Eds), pp.  351. Lawrence Erlbaum Associates, New York, NY, USA.
64. Troille E., Cathiard M. A., Abry C. (2010). "Speech face perception is locked to anticipation in speech production", Speech Commun. Vol 52, 513524. [Crossref]
65. Van Wassenhove V., Grant K. W., Poeppel D. (2005). "Visual speech speeds up the neural processing of auditory speech", Proc. Natl Acad. Sci. USA Vol 102, 11811186. [Crossref]
66. Warren P., Marslen-Wilson W. (1987). "Continuous uptake of acoustic cues in spoken word recognition", Percept. Psychophys. Vol 41, 262275. [Crossref]
67. Warren P., Marslen-Wilson W. (1988). "Cues to lexical choice: discriminating place and voice", Percept. Psychophys. Vol 43, 2130. [Crossref]
68. Weikum W. M., Vouloumanos A., Navarra J., Soto-Faraco S., Sebastián-Galles N., Werker J. F. (2007). "Visual language discrimination in infancy", Science Vol 316, 1159. [Crossref]
69. West P. (1999). "Perception of distributed coarticulatory properties of English /l/ and /r/", J. Phon. Vol 27, 405426.
70. Yehia H., Rubin P., Vatikiotis-Bateson E. (1998). "Quantitative association of vocal-tract and facial behavior", Speech Commun. Vol 16, 2343. [Crossref]

Article metrics loading...



Can't access your account?
  • Tools

  • Add to Favorites
  • Printable version
  • Email this page
  • Subscribe to ToC alert
  • Get permissions
  • Recommend to your library

    You must fill out fields marked with: *

    Librarian details
    Your details
    Why are you recommending this title?
    Select reason:
    Multisensory Research — Recommend this title to your library
  • Export citations
  • Key

  • Full access
  • Open Access
  • Partial/No accessInformation