51
|
Bharath Kumar SV, Umesh S. Nonuniform speaker normalization using affine transformation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 124:1727-1738. [PMID: 19045663 DOI: 10.1121/1.2951597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
In this paper, a well-motivated nonuniform speaker normalization model that affinely relates the formant frequencies of speakers enunciating the same sound is proposed. Using the proposed affine model, the corresponding universal-warping function that is required for normalization is shown to have the same parametric form as the mel scale formula. The parameters of this universal-warping function are estimated from the vowel formant data and are shown to be close to the commonly used formula for the mel scale. This shows an interesting connection between nonuniform speaker normalization and the psychoacoustics based mel scale. In addition, the affine model fits the vowel formant data better than commonly used ad hoc normalization models. This work is motivated by a desire to improve the performance of speaker-independent speech recognition systems, where speaker normalization is conventionally done by assuming a linear-scaling relationship between spectra of speakers. The proposed affine relation is extended to describe the relationship between spectra of speakers enunciating the same sound. On a telephone-based connected digit recognition task, the proposed model provides improved recognition performance over the linear-scaling model.
Collapse
Affiliation(s)
- S V Bharath Kumar
- Department of Electrical and Computer Engineering, University of California-San Diego, La Jolla, California 92093-0407, USA.
| | | |
Collapse
|
52
|
Liu C, Eddins DA. Effects of spectral modulation filtering on vowel identification. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 124:1704-15. [PMID: 19045661 PMCID: PMC2676619 DOI: 10.1121/1.2956468] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
The goal of this study was to measure the effects of global spectral manipulations on vowel identification by progressively high-pass filtering vowel stimuli in the spectral modulation domain. Twelve American-English vowels, naturally spoken by a female talker, were subjected to varied degrees of high-pass filtering in the spectral modulation domain, with cutoff frequencies of 0.0, 0.5, 1.0, 1.5, and 2.0 cycles/octave. Identification performance for vowels presented at 70 dB sound pressure level with and without spectral modulation filtering was measured for five normal-hearing listeners. Results indicated that vowel identification performance was progressively degraded as the spectral modulation cutoff frequency increased. Degradation of vowel identification was greater for back vowels than for front or central vowels. Detailed acoustic analyses indicated that spectral modulation filtering resulted in a more crowded vowel space (F1xF2), reduced spectral contrast, and reduced spectral tilt relative to the original unfiltered vowels. Changes in the global spectral features produced by spectral modulation filtering were associated with substantial reduction in vowel identification. The results indicated that the spectral cues critical for vowel identification were represented by spectral modulation frequencies below 2 cycles/octave. These results are considered in terms of the interactions among spectral shape perception, spectral smearing, and speech perception.
Collapse
Affiliation(s)
- Chang Liu
- Department of Communication Sciences and Disorders, University of Texas at Austin, 1 University Station A1100, Austin, Texas 78712, USA.
| | | |
Collapse
|
53
|
Patterson RD, Johnsrude IS. Functional imaging of the auditory processing applied to speech sounds. Philos Trans R Soc Lond B Biol Sci 2008; 363:1023-35. [PMID: 17827103 PMCID: PMC2606794 DOI: 10.1098/rstb.2007.2157] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this paper, we describe domain-general auditory processes that we believe are prerequisite to the linguistic analysis of speech. We discuss biological evidence for these processes and how they might relate to processes that are specific to human speech and language. We begin with a brief review of (i) the anatomy of the auditory system and (ii) the essential properties of speech sounds. Section 4 describes the general auditory mechanisms that we believe are applied to all communication sounds, and how functional neuroimaging is being used to map the brain networks associated with domain-general auditory processing. Section 5 discusses recent neuroimaging studies that explore where such general processes give way to those that are specific to human speech and language.
Collapse
Affiliation(s)
- Roy D Patterson
- Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge CB2 3EG, UK.
| | | |
Collapse
|
54
|
Dorman MF, Gifford RH, Spahr AJ, McKarns SA. The benefits of combining acoustic and electric stimulation for the recognition of speech, voice and melodies. Audiol Neurootol 2007; 13:105-12. [PMID: 18057874 PMCID: PMC3559130 DOI: 10.1159/000111782] [Citation(s) in RCA: 192] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2007] [Accepted: 07/18/2007] [Indexed: 11/19/2022] Open
Abstract
Fifteen patients fit with a cochlear implant in one ear and a hearing aid in the other ear were presented with tests of speech and melody recognition and voice discrimination under conditions of electric (E) stimulation, acoustic (A) stimulation and combined electric and acoustic stimulation (EAS). When acoustic information was added to electrically stimulated information performance increased by 17-23 percentage points on tests of word and sentence recognition in quiet and sentence recognition in noise. On average, the EAS patients achieved higher scores on CNC words than patients fit with a unilateral cochlear implant. While the best EAS patients did not outperform the best patients fit with a unilateral cochlear implant, proportionally more EAS patients achieved very high scores on tests of speech recognition than unilateral cochlear implant patients.
Collapse
Affiliation(s)
- Michael F Dorman
- Department of Speech and Hearing Science, Arizona State University, Tempe, AZ 85287-0102, USA.
| | | | | | | |
Collapse
|
55
|
Villacorta VM, Perkell JS, Guenther FH. Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2007; 122:2306-19. [PMID: 17902866 DOI: 10.1121/1.2773966] [Citation(s) in RCA: 173] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
The role of auditory feedback in speech motor control was explored in three related experiments. Experiment 1 investigated auditory sensorimotor adaptation: the process by which speakers alter their speech production to compensate for perturbations of auditory feedback. When the first formant frequency (F1) was shifted in the feedback heard by subjects as they produced vowels in consonant-vowel-consonant (CVC) words, the subjects' vowels demonstrated compensatory formant shifts that were maintained when auditory feedback was subsequently masked by noise-evidence of adaptation. Experiment 2 investigated auditory discrimination of synthetic vowel stimuli differing in F1 frequency, using the same subjects. Those with more acute F1 discrimination had compensated more to F1 perturbation. Experiment 3 consisted of simulations with the directions into velocities of articulators model of speech motor planning, which showed that the model can account for key aspects of compensation. In the model, movement goals for vowels are regions in auditory space; perturbation of auditory feedback invokes auditory feedback control mechanisms that correct for the perturbation, which in turn causes updating of feedforward commands to incorporate these corrections. The relation between speaker acuity and amount of compensation to auditory perturbation is mediated by the size of speakers' auditory goal regions, with more acute speakers having smaller goal regions.
Collapse
Affiliation(s)
- Virgilio M Villacorta
- Speech Communication Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Room 36-591, 50 Vassar Street, Cambridge, Massachusetts 02139, USA
| | | | | |
Collapse
|
56
|
Hoelterhoff J, Reetz H. Acoustic cues discriminating german obstruents in place and manner of articulation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2007; 121:1142-56. [PMID: 17348535 DOI: 10.1121/1.2427122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
This study focuses on the extraction of robust acoustic cues of labial and alveolar voiceless obstruents in German and their acoustic differences in the speech signal to distinguish them in place and manner of articulation. The investigated obstruents include the affricates [pf] and [ts], the fricatives [f] and [s] and the stops [p] and [t]. The target sounds were analyzed in word-initial and word-medial positions. The speech data for the analysis were recorded in a natural environment, deliberately containing background noise to extract robust cues only. Three methods of acoustic analysis were chosen: (1) temporal measurements to distinguish the respective obstruents in manner of articulation, (2) static spectral characteristics in terms of logarithmic distance measure to distinguish place of articulation, and (3) amplitudinal analysis of discrete frequency bands as a dynamic approach to place distinction. The results reveal that the duration of the target phonemes distinguishes these in manner of articulation. Logarithmic distance measure, as well as relative amplitude analysis of discrete frequency bands, identifies place of articulation. The present results contribute to the question, which properties are robust with respect to variation in the speech signal.
Collapse
Affiliation(s)
- Julia Hoelterhoff
- Universität Konstanz, Fachbereich Sprachwissenschaft, D-186, 78457 Konstanz, Germany.
| | | |
Collapse
|
57
|
Eriksson JL, Villa AEP. Learning of auditory equivalence classes for vowels by rats. Behav Processes 2006; 73:348-59. [PMID: 16997507 DOI: 10.1016/j.beproc.2006.08.005] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2006] [Accepted: 08/23/2006] [Indexed: 11/23/2022]
Abstract
Four male Long-Evans rats were trained to discriminate between synthetic vowel sounds using a GO/NOGO response choice task. The vowels were characterized by an increase in fundamental frequency correlated with an upward shift in formant frequencies. In an initial phase we trained the subjects to discriminate between two vowel categories using two exemplars from each category. In a subsequent phase the ability of the rats to generalize the discrimination between the two categories was tested. To test whether rats might exploit the fact that attributes of training stimuli covaried, we used non-standard stimuli with a reversed relation between fundamental frequency and formants. The overall results demonstrate that rats are able to generalize the discrimination to new instances of the same vowels. We present evidence that the performance of the subjects depended on the relation between fundamental and formant frequencies that they had previously been exposed to. Simple simulation results with artificial neural networks could reproduce most of the behavioral results and support the hypothesis that equivalence classes for vowels are associated with an experience-driven process based on general properties of peripheral auditory coding mixed with elementary learning mechanisms. These results suggest that rats use spectral and temporal cues similarly to humans despite differences in basic auditory capabilities.
Collapse
Affiliation(s)
- Jan L Eriksson
- Neuroheuristic Research Group, Department of Information Sciences, INFORGE, Université de Lausanne, 1015 Lausanne, Switzerland.
| | | |
Collapse
|
58
|
Wassink AB. A geometric representation of spectral and temporal vowel features: quantification of vowel overlap in three linguistic varieties. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2006; 119:2334-50. [PMID: 16642847 DOI: 10.1121/1.2168414] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
A geometrical method for computing overlap between vowel distributions, the spectral overlap assessment metric (SOAM), is applied to an investigation of spectral (F1, F2) and temporal (duration) relations in three different types of systems: one claimed to exhibit primary quality (American English), one primary quantity (Jamaican Creole), and one about which no claims have been made (Jamaican English). Shapes, orientations, and proximities of pairs of vowel distributions involved in phonological oppositions are modeled using best-fit ellipses (in F1 x F2 space) and ellipsoids (F1 x F2 x duration). Overlap fractions computed for each pair suggest that spectral and temporal features interact differently in the three varieties and oppositions. Under a two-dimensional analysis, two of three American English oppositions show no overlap; the third shows partial overlap. All Jamaican Creole oppositions exhibit complete overlap when F1 and F2 alone are modeled, but no or partial overlap with incorporation of a factor for duration. Jamaican English three-dimensional overlap fractions resemble two-dimensional results for American English. A multidimensional analysis tool such as SOAM appears to provide a more objective basis for simultaneously investigating spectral and temporal relations within vowel systems. Normalization methods and the SOAM method are described in an extended appendix.
Collapse
|
59
|
Guenther FH, Ghosh SS, Tourville JA. Neural modeling and imaging of the cortical interactions underlying syllable production. BRAIN AND LANGUAGE 2006; 96:280-301. [PMID: 16040108 PMCID: PMC1473986 DOI: 10.1016/j.bandl.2005.06.001] [Citation(s) in RCA: 532] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2004] [Revised: 03/25/2005] [Accepted: 06/08/2005] [Indexed: 05/03/2023]
Abstract
This paper describes a neural model of speech acquisition and production that accounts for a wide range of acoustic, kinematic, and neuroimaging data concerning the control of speech movements. The model is a neural network whose components correspond to regions of the cerebral cortex and cerebellum, including premotor, motor, auditory, and somatosensory cortical areas. Computer simulations of the model verify its ability to account for compensation to lip and jaw perturbations during speech. Specific anatomical locations of the model's components are estimated, and these estimates are used to simulate fMRI experiments of simple syllable production.
Collapse
Affiliation(s)
- Frank H. Guenther
- Department of Cognitive and Neural Systems, Boston University, 677 Beacon Street, Boston, MA, 02215, Telephone: (617) 353-5765, Fax Number: (617) 353-7755,
- Speech and Hearing Bioscience and Technology Program Harvard University/Massachusetts Institute of Technology Cambridge, MA 02139
- Athinoula A. Martinos Center for Biomedical Imaging Massachusetts General Hospital Charlestown, MA 02129
| | - Satrajit S. Ghosh
- Department of Cognitive and Neural Systems, Boston University, 677 Beacon Street, Boston, MA, 02215, Telephone: (617) 353-5765, Fax Number: (617) 353-7755,
| | - Jason A. Tourville
- Department of Cognitive and Neural Systems, Boston University, 677 Beacon Street, Boston, MA, 02215, Telephone: (617) 353-5765, Fax Number: (617) 353-7755,
| |
Collapse
|
60
|
Liu C, Kewley-Port D. Formant discrimination in noise for isolated vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2004; 116:3119-3129. [PMID: 15603157 DOI: 10.1121/1.1802671] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Formant discrimination for isolated vowels presented in noise was investigated for normal-hearing listeners. Discrimination thresholds for F1 and F2, for the seven American English vowels /i, I, epsilon, ae, [symbol see text], a, u/, were measured under two types of noise, long-term speech-shaped noise (LTSS) and multitalker babble, and also under quiet listening conditions. Signal-to-noise ratios (SNR) varied from -4 to +4 dB in steps of 2 dB. All three factors, formant frequency, signal-to-noise ratio, and noise type, had significant effects on vowel formant discrimination. Significant interactions among the three factors showed that threshold-frequency functions depended on SNR and noise type. The thresholds at the lowest levels of SNR were highly elevated by a factor of about 3 compared to those in quiet. The masking functions (threshold vs SNR) were well described by a negative exponential over F1 and F2 for both LTSS and babble noise. Speech-shaped noise was a slightly more effective masker than multitalker babble, presumably reflecting small benefits (1.5 dB) due to the temporal variation of the babble.
Collapse
Affiliation(s)
- Chang Liu
- Department of Speech and Hearing Sciences, Indiana University, Bloomington, Indiana 47405, USA.
| | | |
Collapse
|
61
|
Adank P, Smits R, van Hout R. A comparison of vowel normalization procedures for language variation research. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2004; 116:3099-107. [PMID: 15603155 DOI: 10.1121/1.1795335] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
An evaluation of vowel normalization procedures for the purpose of studying language variation is presented. The procedures were compared on how effectively they (a) preserve phonemic information, (b) preserve information about the talker's regional background (or sociolinguistic information), and (c) minimize anatomical/physiological variation in acoustic representations of vowels. Recordings were made for 80 female talkers and 80 male talkers of Dutch. These talkers were stratified according to their gender and regional background. The normalization procedures were applied to measurements of the fundamental frequency and the first three formant frequencies for a large set of vowel tokens. The normalization procedures were evaluated through statistical pattern analysis. The results show that normalization procedures that use information across multiple vowels ("vowel-extrinsic" information) to normalize a single vowel token performed better than those that include only information contained in the vowel token itself ("vowel-intrinsic" information). Furthermore, the results show that normalization procedures that operate on individual formants performed better than those that use information across multiple formants (e.g., "formant-extrinsic" F2-F1).
Collapse
Affiliation(s)
- Patti Adank
- Center for Language Studies, Radboud University Nijmegen, 6500 HD Nijmegen, The Netherlands.
| | | | | |
Collapse
|
62
|
Allen JS, Miller JL. Listener sensitivity to individual talker differences in voice-onset-time. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2004; 115:3171-83. [PMID: 15237841 DOI: 10.1121/1.1701898] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Recent findings in the domains of word and talker recognition reveal that listeners use previous experience with an individual talker's voice to facilitate subsequent perceptual processing of that talker's speech. These findings raise the possibility that listeners are sensitive to talker-specific acoustic-phonetic properties. The present study tested this possibility directly by examining listeners' sensitivity to talker differences in the voice-onset-time (VOT) associated with a word-initial voiceless stop consonant. Listeners were trained on the speech of two talkers. Speech synthesis was used to manipulate the VOTs of these talkers so that one had short VOTs and the other had long VOTs (counterbalanced across listeners). The results of two experiments using a paired-comparison task revealed that, when presented with a short- versus long-VOT variant of a given talker's speech, listeners could select the variant consistent with their experience of that talker's speech during training. This was true when listeners were tested on the same word heard during training and when they were tested on a different word spoken by the same talker, indicating that listeners generalized talker-specific VOT information to a novel word. Such sensitivity to talker-specific acoustic-phonetic properties may subserve at least in part listeners' capacity to benefit from talker-specific experience.
Collapse
Affiliation(s)
- J Sean Allen
- Department of Psychology, Northeastern University, Boston, Massachusetts 02115, USA
| | | |
Collapse
|
63
|
Apostol L, Perrier P, Bailly G. A model of acoustic interspeaker variability based on the concept of formant-cavity affiliation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2004; 115:337-351. [PMID: 14759026 DOI: 10.1121/1.1631946] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
A method is proposed to model the interspeaker variability of formant patterns for oral vowels. It is assumed that this variability originates in the differences existing among speakers in the respective lengths of their front and back vocal-tract cavities. In order to characterize, from the spectral description of the acoustic speech signal, these vocal-tract differences between speakers, each formant is interpreted, according to the concept of formant-cavity affiliation, as a resonance of a specific vocal-tract cavity. Its frequency can thus be directly related to the corresponding cavity length, and a transformation model can be proposed from a speaker A to a speaker B on the basis of the frequency ratios of the formants corresponding to the same resonances. In order to minimize the number of sounds to be recorded for each speaker in order to carry out this speaker transformation, the frequency ratios are exactly computed only for the three extreme cardinal vowels [i, a, u] and they are approximated for the remaining vowels through an interpolation function. The method is evaluated through its capacity to transform the (F1,F2) formant patterns of eight oral vowels pronounced by five male speakers into the (F1,F2) patterns of the corresponding vowels generated by an articulatory model of the vocal tract. The resulting formant patterns are compared to those provided by normalization techniques published in the literature. The proposed method is found to be efficient, but a number of limitations are also observed and discussed. These limitations can be associated with the formant-cavity affiliation model itself or with a possible influence of speaker-specific vocal-tract geometry in the cross-sectional direction, which the model might not have taken into account.
Collapse
Affiliation(s)
- Lian Apostol
- Institut de la Communication Parlée, UMR CNRS 5009, INPG, Grenoble, France
| | | | | |
Collapse
|
64
|
Hillenbrand JM, Houde RA. A narrow band pattern-matching model of vowel perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2003; 113:1044-1055. [PMID: 12597197 DOI: 10.1121/1.1513647] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The purpose of this paper is to propose and evaluate a new model of vowel perception which assumes that vowel identity is recognized by a template-matching process involving the comparison of narrow band input spectra with a set of smoothed spectral-shape templates that are learned through ordinary exposure to speech. In the present simulation of this process, the input spectra are computed over a sufficiently long window to resolve individual harmonics of voiced speech. Prior to template creation and pattern matching, the narrow band spectra are amplitude equalized by a spectrum-level normalization process, and the information-bearing spectral peaks are enhanced by a "flooring" procedure that zeroes out spectral values below a threshold function consisting of a center-weighted running average of spectral amplitudes. Templates for each vowel category are created simply by averaging the narrow band spectra of like vowels spoken by a panel of talkers. In the present implementation, separate templates are used for men, women, and children. The pattern matching is implemented with a simple city-block distance measure given by the sum of the channel-by-channel differences between the narrow band input spectrum (level-equalized and floored) and each vowel template. Spectral movement is taken into account by computing the distance measure at several points throughout the course of the vowel. The input spectrum is assigned to the vowel template that results in the smallest difference accumulated over the sequence of spectral slices. The model was evaluated using a large database consisting of 12 vowels in /hVd/ context spoken by 45 men, 48 women, and 46 children. The narrow band model classified vowels in this database with a degree of accuracy (91.4%) approaching that of human listeners.
Collapse
Affiliation(s)
- James M Hillenbrand
- Department of Speech Pathology and Audiology, Western Michigan University, Kalamazoo, Michigan 49008, USA.
| | | |
Collapse
|
65
|
Dorman MF, Loizou PC, Spahr AJ, Maloff E. Factors that allow a high level of speech understanding by patients fit with cochlear implants. Am J Audiol 2002; 11:119-23. [PMID: 12691222 DOI: 10.1044/1059-0889(2002/014)] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Three factors account for the high level of speech understanding in quiet enjoyed by many patients fit with cochlear implants. First, some information about speech exists in the time/amplitude envelope of speech. This information is sufficient to narrow the number of word candidates for a given signal. Second, if information from the envelope of speech is available to listeners, then only minimal information from the frequency domain is necessary for high levels of speech recognition in quiet. Third, perceiving strategies for speech are inherently flexible in terms of the mapping between signal frequencies (i.e., the locations of the formants) and phonetic identity.
Collapse
Affiliation(s)
- Michael F Dorman
- Department of Speech and Hearing Science, Arizona State University, Tempe 85287-0102, USA.
| | | | | | | |
Collapse
|
66
|
Abstract
The first two formant frequencies (F1 and F2) are the cues important for vowel identification. In the categorization of the naturally spoken vowels, however, there are overlaps among the vowels in the F1 and F2 plane. The fundamental frequency (F0), the third formant frequency (F3) and the spectral envelope have been proposed as additional cues. In the present study, to investigate the spectral regions essential for the vowel identification, untrained subjects performed the forced-choice identification task in response to Japanese isolated vowels (/a, o, u, e, i/), in which some spectral regions were deleted. Minimum spectral regions needed for correct vowel identification were the two regions including F1 and F2 (the first and fourth in the quadrisected F1-F2 regions in Bark scale). This was true even when phonetically different vowels had a similar combination of F1 and F2 frequency components. F0 and F3 cues were not necessarily needed. It is concluded that the relative importance in the spectral region is not equivalent, but weighted on the two critical spectral regions. The auditory system may identify the vowels by analyzing the information of the spectral shapes and the formant frequencies (F1 and F2) in these critical spectral regions.
Collapse
Affiliation(s)
- Shuichi Sakayori
- Department of Physiology, Yamanashi Medical University, Tamaho, Japan.
| | | | | | | | | |
Collapse
|
67
|
Slud E, Stone M, Smith PJ, Goldstein M. Principal components representation of the two-dimensional coronal tongue surface. PHONETICA 2002; 59:108-133. [PMID: 12232463 DOI: 10.1159/000066066] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
This paper uses principal components (PC) analysis to represent coronal tongue contours for the 11 vowels of English in two consonant contexts (/s/, /l/), based upon five replicated measurements in three sessions for each of 6 subjects. Curves from multiple sessions and speakers were overlaid before analysis onto a common (x, y) coordinate system by extensive preprocessing of the curves including: extension (padding) or truncation within session, translation, and truncation to a common x range. Four PCs plus a mean level allow accurate representation of coronal tongue curves, but PC shapes depend strongly on the degree of padding or truncation. The PCs successfully reduced the dimensionality of the curves and reflected vowel height, consonant context, and physiological features.
Collapse
Affiliation(s)
- Eric Slud
- Mathematics Department, University of Maryland, College Park, Md 20742, USA.
| | | | | | | |
Collapse
|
68
|
Ménard L, Schwartz JL, Boë LJ, Kandel S, Vallée N. Auditory normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2002; 111:1892-1905. [PMID: 12002872 DOI: 10.1121/1.1459467] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
The present article aims at exploring the invariant parameters involved in the perceptual normalization of French vowels. A set of 490 stimuli, including the ten French vowels /i y u e ø o E oe (inverted c) a/ produced by an articulatory model, simulating seven growth stages and seven fundamental frequency values, has been submitted as a perceptual identification test to 43 subjects. The results confirm the important effect of the tonality distance between F1 and f0 in perceived height. It does not seem, however, that height perception involves a binary organization determined by the 3-3.5-Bark critical distance. Regarding place of articulation, the tonotopic distance between F1 and F2 appears to be the best predictor of the perceived front-back dimension. Nevertheless, the role of the difference between F2 and F3 remains important. Roundedness is also examined and correlated to the effective second formant, involving spectral integration of higher formants within the 3.5-Bark critical distance. The results shed light on the issue of perceptual invariance, and can be interpreted as perceptual constraints imposed on speech production.
Collapse
Affiliation(s)
- Lucie Ménard
- ICP-INPG, UMR CNRS No. 5009, Université Stendhal, Grenoble, France.
| | | | | | | | | |
Collapse
|
69
|
Perkell J, Numa W, Vick J, Lane H, Balkany T, Gould J. Language-specific, hearing-related changes in vowel spaces: a preliminary study of English- and Spanish-speaking cochlear implant users. Ear Hear 2001; 22:461-70. [PMID: 11770669 DOI: 10.1097/00003446-200112000-00003] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
OBJECTIVE This study investigates the role of hearing in vowel productions of postlingually deafened cochlear implant users. Two hypotheses are tested that derive from the view that vowel production is influenced by competing demands of intelligibility for the listener and least effort in the speaker: 1) Hearing enables a cochlear implant user to produce vowels distinctly from one another; without hearing, the speaker may give more weight to economy of effort, leading to reduced vowel separation. 2) Speakers may need to produce vowels more distinctly from one another in a language with a relatively "crowded" vowel space, such as American English, than in a language with relatively few vowels, such as Spanish. Thus, when switching between hearing and non-hearing states, English speakers may show a tradeoff between vowel distinctiveness and least effort, whereas Spanish speakers may not. DESIGN To test the prediction that there will be a reduction of average vowel spacing (AVS) (average intervowel distance in the F1-F2 plane) with interrupted hearing for English-speaking cochlear implant users, but no systematic change in AVS for Spanish cochlear implant users, vowel productions of seven English-speaking and seven Spanish-speaking cochlear implant users, who had been using their implants for at least 1 yr, were recorded when their implant speech processors were turned off and on several times in two sessions. RESULTS AVS was consistently larger for the English speakers with hearing than without hearing. The magnitude and direction of AVS change was more variable for the Spanish speakers, both within and between subjects. CONCLUSION Vowel distinctiveness was enhanced with the provision of some hearing in the language group with a more crowded vowel space but not in the language group with fewer vowels. The view that speakers seek to minimize effort while maintaining the distinctiveness of acoustic goals receives some support.
Collapse
Affiliation(s)
- J Perkell
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge 02139, USA
| | | | | | | | | | | |
Collapse
|
70
|
Ito M, Tsuchida J, Yano M. On the effectiveness of whole spectral shape for vowel perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 110:1141-1149. [PMID: 11519581 DOI: 10.1121/1.1384908] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
The formant hypothesis of vowel perception, where the lowest two or three formant frequencies are essential cues for vowel quality perception, is widely accepted. There has, however, been some controversy suggesting that formant frequencies are not sufficient and that the whole spectral shape is necessary for perception. Three psychophysical experiments were performed to study this question. In the first experiment, the first or second formant peak of stimuli was suppressed as much as possible while still maintaining the original spectral shape. The responses to these stimuli were not radically different from the ones for the unsuppressed control. In the second experiment, F2-suppressed stimuli, whose amplitude ratios of high- to low-frequency components were systemically changed, were used. The results indicate that the ratio changes can affect perceived vowel quality, especially its place of articulation. In the third experiment, the full-formant stimuli, whose amplitude ratios were changed from the original and whose F2's were kept constant, were used. The results suggest that the amplitude ratio is equal to or more effective than F2 as a cue for place of articulation. We conclude that formant frequencies are not exclusive cues and that the whole spectral shape can be crucial for vowel perception.
Collapse
Affiliation(s)
- M Ito
- Wako Research Center, Honda R&D Co, Ltd, Saitama, Japan.
| | | | | |
Collapse
|
71
|
Leek MR, Summers V. Pitch strength and pitch dominance of iterated rippled noises in hearing-impaired listeners. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 109:2944-2954. [PMID: 11425136 DOI: 10.1121/1.1371761] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Reports using a variety of psychophysical tasks indicate that pitch perception by hearing-impaired listeners may be abnormal, contributing to difficulties in understanding speech and enjoying music. Pitches of complex sounds may be weaker and more indistinct in the presence of cochlear damage, especially when frequency regions are affected that form the strongest basis for pitch perception in normal-hearing listeners. In this study, the strength of the complex pitch generated by iterated rippled noise was assessed in normal-hearing and hearing-impaired listeners. Pitch strength was measured for broadband noises with spectral ripples generated by iteratively delaying a copy of a given noise and adding it back into the original. Octave-band-pass versions of these noises also were evaluated to assess frequency dominance regions for rippled-noise pitch. Hearing-impaired listeners demonstrated consistently weaker pitches in response to the rippled noises relative to pitch strength in normal-hearing listeners. However, in most cases, the frequency regions of pitch dominance, i.e., strongest pitch, were similar to those observed in normal-hearing listeners. Except where there exists a substantial sensitivity loss, contributions from normal pitch dominance regions associated with the strongest pitches may not be directly related to impaired spectral processing. It is suggested that the reduced strength of rippled-noise pitch in listeners with hearing loss results from impaired frequency resolution and possibly an associated deficit in temporal processing.
Collapse
Affiliation(s)
- M R Leek
- Army Audiology and Speech Center, Walter Reed Army Medical Center, Washington, DC 20307-5001, USA
| | | |
Collapse
|
72
|
Hillenbrand JM, Clark MJ, Nearey TM. Effects of consonant environment on vowel formant patterns. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 109:748-63. [PMID: 11248979 DOI: 10.1121/1.1337959] [Citation(s) in RCA: 88] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
A significant body of evidence has accumulated indicating that vowel identification is influenced by spectral change patterns. For example, a large-scale study of vowel formant patterns showed substantial improvements in category separability when a pattern classifier was trained on multiple samples of the formant pattern rather than a single sample at steady state [J. Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. However, in the earlier study all utterances were recorded in a constant /hVd/ environment. The main purpose of the present study was to determine whether a close relationship between vowel identity and spectral change patterns is maintained when the consonant environment is allowed to vary. Recordings were made of six men and six women producing eight vowels (see text) in isolation and in CVC syllables. The CVC utterances consisted of all combinations of seven initial consonants (/h,b,d,g,p,t,k/) and six final consonants (/b,d,g,p,t,k/). Formant frequencies for F1-F3 were measured every 5 ms during the vowel using an interactive editing tool. Results showed highly significant effects of phonetic environment. As with an earlier study of this type, particularly large shifts in formant patterns were seen for rounded vowels in alveolar environments [K. Stevens and A. House, J. Speech Hear. Res. 6, 111-128 (1963)]. Despite these context effects, substantial improvements in category separability were observed when a pattern classifier incorporated spectral change information. Modeling work showed that many aspects of listener behavior could be accounted for by a fairly simple pattern classifier incorporating F0, duration, and two discrete samples of the formant pattern.
Collapse
Affiliation(s)
- J M Hillenbrand
- Speech Pathology and Audiology, Western Michigan University, Kalamazoo 49008, USA.
| | | | | |
Collapse
|
73
|
Callan DE, Kent RD, Guenther FH, Vorperian HK. An auditory-feedback-based neural network model of speech production that is robust to developmental changes in the size and shape of the articulatory system. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2000; 43:721-736. [PMID: 10877441 DOI: 10.1044/jslhr.4303.721] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
The purpose of this article is to demonstrate that self-produced auditory feedback is sufficient to train a mapping between auditory target space and articulator space under conditions in which the structures of speech production are undergoing considerable developmental restructuring. One challenge for competing theories that propose invariant constriction targets is that it is unclear what teaching signal could specify constriction location and degree so that a mapping between constriction target space and articulator space can be learned. It is predicted that a model trained by auditory feedback will accomplish speech goals, in auditory target space, by continuously learning to use different articulator configurations to adapt to the changing acoustic properties of the vocal tract during development. The Maeda articulatory synthesis part of the DIVA neural network model (Guenther et al., 1998) was modified to reflect the development of the vocal tract by using measurements taken from MR images of children. After training, the model was able to maintain the 11 English vowel targets in auditory planning space, utilizing varying articulator configurations, despite morphological changes that occur during development. The vocal-tract constriction pattern (derived from the vocal-tract area function) as well as the formant values varied during the course of development in correspondence with morphological changes in the structures involved with speech production. Despite changes in the acoustical properties of the vocal tract that occur during the course of development, the model was able to demonstrate motor-equivalent speech production under lip-restriction conditions. The model accomplished this in a self-organizing manner even though there was no prior experience with lip restriction during training.
Collapse
Affiliation(s)
- D E Callan
- ATR Human Information Processing Research Laboratories, Kyoto, Japan.
| | | | | | | |
Collapse
|
74
|
Kewley-Port D, Zheng Y. Vowel formant discrimination: towards more ordinary listening conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 106:2945-2958. [PMID: 10573907 DOI: 10.1121/1.428134] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Thresholds for formant frequency discrimination have been established using optimal listening conditions. In normal conversation, the ability to discriminate formant frequency is probably substantially degraded. The purpose of the present study was to change the listening procedures in several substantial ways from optimal towards more ordinary listening conditions, including a higher level of stimulus uncertainty, increased levels of phonetic context, and with the addition of a sentence identification task. Four vowels synthesized from a female talker were presented in isolation, or in the phonetic context of /bVd/ syllables, three-word phrases, or nine-word sentences. In the first experiment, formant resolution was estimated under medium stimulus uncertainty for three levels of phonetic context. Some undesirable training effects were obtained and led to the design of a new protocol for the second experiment to reduce this problem and to manipulate both length of phonetic context and level of difficulty in the simultaneous sentence identification task. Similar results were obtained in both experiments. The effect of phonetic context on formant discrimination is reduced as context lengthens such that no difference was found between vowels embedded in the phrase or sentence contexts. The addition of a challenging sentence identification task to the discrimination task did not degrade performance further and a stable pattern for formant discrimination in sentences emerged. This norm for the resolution of vowel formants under these more ordinary listening conditions was shown to be nearly a constant at 0.28 barks. Analysis of vowel spaces from 16 American English talkers determined that the closest vowels, on average, were 0.56 barks apart, that is, a factor of 2 larger than the norm obtained in these vowel formant discrimination tasks.
Collapse
Affiliation(s)
- D Kewley-Port
- Department of Speech and Hearing Sciences, Indiana University, Bloomington 47405, USA.
| | | |
Collapse
|
75
|
Bachorowski JA, Owren MJ. Acoustic correlates of talker sex and individual talker identity are present in a short vowel segment produced in running speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 106:1054-1063. [PMID: 10462810 DOI: 10.1121/1.427115] [Citation(s) in RCA: 122] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Although listeners routinely perceive both the sex and individual identity of talkers from their speech, explanations of these abilities are incomplete. Here, variation in vocal production-related anatomy was assumed to affect vowel acoustics thought to be critical for indexical cueing. Integrating this approach with source-filter theory, patterns of acoustic parameters that should represent sex and identity were identified. Due to sexual dimorphism, the combination of fundamental frequency (F0, reflecting larynx size) and vocal tract length cues (VTL, reflecting body size) was predicted to provide the strongest acoustic correlates of talker sex. Acoustic measures associated with presumed variations in supralaryngeal vocal tract-related anatomy occurring within sex were expected to be prominent in individual talker identity. These predictions were supported by results of analyses of 2500 tokens of the /epsilon/ phoneme, extracted from the naturally produced speech of 125 subjects. Classification by talker sex was virtually perfect when F0 and VTL were used together, whereas talker classification depended primarily on the various acoustic parameters associated with vocal-tract filtering.
Collapse
Affiliation(s)
- J A Bachorowski
- Department of Psychology, Vanderbilt University, Nashville, Tennessee 37240, USA.
| | | |
Collapse
|
76
|
Jenkins JJ, Strange W, Trent SA. Context-independent dynamic information for the perception of coarticulated vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 106:438-448. [PMID: 10420634 DOI: 10.1121/1.427067] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Most investigators agree that the acoustic information for American English vowels includes dynamic (time-varying) parameters as well as static "target" information contained in a single cross section of the syllable. Using the silent-center (SC) paradigm, the present experiment examined the case in which the initial and final portions of stop consonant-vowel-stop consonant (CVC) syllables containing the same vowel but different consonants were recombined into mixed-consonant SC syllables and presented to listeners for vowel identification. Ten vowels were spoken in six different syllables, /b Vb, bVd, bVt, dVb, dVd, dVt/, embedded in a carrier sentence. Initial and final transitional portions of these syllables were cross-matched in: (1) silent-center syllables with original syllable durations (silences) preserved (mixed-consonant SC condition) and (2) mixed-consonant SC syllables with syllable duration equated across the ten vowels (fixed duration mixed-consonant SC condition). Vowel-identification accuracy in these two mixed consonant SC conditions was compared with performance on the original SC and fixed duration SC stimuli, and in initial and final control conditions in which initial and final transitional portions were each presented alone. Vowels were identified highly accurately in both mixed-consonant SC and original syllable SC conditions (only 7%-8% overall errors). Neutralizing duration information led to small, but significant, increases in identification errors in both mixed-consonant and original fixed-duration SC conditions (14%-15% errors), but performance was still much more accurate than for initial and finals control conditions (35% and 52% errors, respectively). Acoustical analysis confirmed that direction and extent of formant change from initial to final portions of mixed-consonant stimuli differed from that of original syllables, arguing against a target + offglide explanation of the perceptual results. Results do support the hypothesis that temporal trajectories specifying "style of movement" provide information for the differentiation of American English tense and lax vowels, and that this information is invariant over the place of articulation and voicing of the surrounding stop consonants.
Collapse
Affiliation(s)
- J J Jenkins
- Department of Psychology, University of South Florida, Tampa 33620, USA.
| | | | | |
Collapse
|
77
|
de Cheveigné A, Kawahara H. Missing-data model of vowel identification. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 105:3497-3508. [PMID: 10380672 DOI: 10.1121/1.424675] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Vowel identity correlates well with the shape of the transfer function of the vocal tract, in particular the position of the first two or three formant peaks. However, in voiced speech the transfer function is sampled at multiples of the fundamental frequency (F0), and the short-term spectrum contains peaks at those frequencies, rather than at formants. It is not clear how the auditory system estimates the original spectral envelope from the vowel waveform. Cochlear excitation patterns, for example, resolve harmonics in the low-frequency region and their shape varies strongly with F0. The problem cannot be cured by smoothing: lag-domain components of the spectral envelope are aliased and cause F0-dependent distortion. The problem is severe at high F0's where the spectral envelope is severely undersampled. This paper treats vowel identification as a process of pattern recognition with missing data. Matching is restricted to available data, and missing data are ignored using an F0-dependent weighting function that emphasizes regions near harmonics. The model is presented in two versions: a frequency-domain version based on short-term spectra, or tonotopic excitation patterns, and a time-domain version based on autocorrelation functions. It accounts for the relative F0-independency observed in vowel identification.
Collapse
Affiliation(s)
- A de Cheveigné
- Laboratoire de Linguistique Formelle, CNRS/Université Paris 7, France.
| | | |
Collapse
|
78
|
Hillenbrand JM, Nearey TM. Identification of resynthesized /hVd/ utterances: effects of formant contour. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1999; 105:3509-23. [PMID: 10380673 DOI: 10.1121/1.424676] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The purpose of this study was to examine the role of formant frequency movements in vowel recognition. Measurements of vowel duration, fundamental frequency, and formant contours were taken from a database of acoustic measurements of 1668 /hVd/ utterances spoken by 45 men, 48 women, and 46 children [Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. A 300-utterance subset was selected from this database, representing equal numbers of 12 vowels and approximately equal numbers of tokens produced by men, women, and children. Listeners were asked to identify the original, naturally produced signals and two formant-synthesized versions. One set of "original formant" (OF) synthetic signals was generated using the measured formant contours, and a second set of "flat formant" (FF) signals was synthesized with formant frequencies fixed at the values measured at the steadiest portion of the vowel. Results included: (a) the OF synthetic signals were identified with substantially greater accuracy than the FF signals; and (b) the naturally produced signals were identified with greater accuracy than the OF synthetic signals. Pattern recognition results showed that a simple approach to vowel specification based on duration, steady-state F0, and formant frequency measurements at 20% and 80% of vowel duration accounts for much but by no means all of the variation in listeners' labeling of the three types of stimuli.
Collapse
Affiliation(s)
- J M Hillenbrand
- Department of Speech Pathology and Audiology, Western Michigan University, Kalamazoo 49008, USA
| | | |
Collapse
|
79
|
Paavilainen P, Jaramillo M, Näätänen R, Winkler I. Neuronal populations in the human brain extracting invariant relationships from acoustic variance. Neurosci Lett 1999; 265:179-82. [PMID: 10327160 DOI: 10.1016/s0304-3940(99)00237-2] [Citation(s) in RCA: 71] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The ability to extract invariant relationships from physically varying stimulation is critical for example to categorical perception of complex auditory information such as speech and music. Human subjects were presented with tone pairs randomly varying over a wide frequency range, there being no physically constant tone pair at all. Instead, the invariant feature was either the direction of the tone pairs (ascending: the second tone was higher in frequency than the first tone) or the frequency ratio (musical interval) of the two tones. The subjects ignored the tone pairs, and instead attended a silent video. Occasional deviant pairs (either descending in direction or having a different frequency ratio) elicited the mismatch negativity (MMN) of the event-related potential, demonstrating the existence of neuronal populations which automatically (independently of attention) extract invariant relationships from acoustical variance.
Collapse
Affiliation(s)
- P Paavilainen
- Department of Psychology, University of Helsinki, Finland.
| | | | | | | |
Collapse
|
80
|
Shannon RV, Zeng FG, Wygonski J. Speech recognition with altered spectral distribution of envelope cues. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1998; 104:2467-76. [PMID: 10491708 DOI: 10.1121/1.423774] [Citation(s) in RCA: 149] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Recognition of consonants, vowels, and sentences was measured in conditions of reduced spectral resolution and distorted spectral distribution of temporal envelope cues. Speech materials were processed through four bandpass filters (analysis bands), half-wave rectified, and low-pass filtered to extract the temporal envelope from each band. The envelope from each speech band modulated a band-limited noise (carrier bands). Analysis and carrier bands were manipulated independently to alter the spectral distribution of envelope cues. Experiment I demonstrated that the location of the cutoff frequencies defining the bands was not a critical parameter for speech recognition, as long as the analysis and carrier bands were matched in frequency extent. Experiment II demonstrated a dramatic decrease in performance when the analysis and carrier bands did not match in frequency extent, which resulted in a warping of the spectral distribution of envelope cues. Experiment III demonstrated a large decrease in performance when the carrier bands were shifted in frequency, mimicking the basal position of electrodes in a cochlear implant. And experiment IV showed a relatively minor effect of the overlap in the noise carrier bands, simulating the overlap in neural populations responding to adjacent electrodes in a cochlear implant. Overall, these results show that, for four bands, the frequency alignment of the analysis bands and carrier bands is critical for good performance, while the exact frequency divisions and overlap in carrier bands are not as critical.
Collapse
Affiliation(s)
- R V Shannon
- House Ear Institute, Los Angeles, California 90057, USA.
| | | | | |
Collapse
|
81
|
Hashi M, Westbury JR, Honda K. Vowel posture normalization. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1998; 104:2426-2437. [PMID: 10491704 DOI: 10.1121/1.423750] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
A simple normalization procedure was applied to point-parametrized articulatory data to yield quantitative speaker-general descriptions of "average" vowel postures. Articulatory data from 20 English and 8 Japanese speakers, drawn from existing x-ray microbeam database corpora, were included in the analysis. The purpose of the normalization procedure was to minimize the effects of differences in vocal tract size and shape on average postures derived from the raw data. The procedure resulted in a general reduction of cross-speaker variance in the y dimension of the normalized space, within both language groups. This result can be traced to a systematic source of variance in the y dimension of the raw data (i.e., palatal height) "successfully removed" from the normalized data. The procedure did not result in a comparable, general reduction in cross-speaker variance in the x dimension. This negative result can be traced partly to the new observation that some speakers within the English sample habitually placed their tongues in a fronted position for all vowels, whereas other speakers habitually placed their tongues in a rearward position. Methods for evaluating articulatory normalization schemes, and possible sources of interspeaker variability in vowel postures, are discussed.
Collapse
Affiliation(s)
- M Hashi
- Waisman Center, University of Wisconsin-Madison 53705-2280, USA
| | | | | |
Collapse
|
82
|
Nygaard LC, Pisoni DB. Talker-specific learning in speech perception. PERCEPTION & PSYCHOPHYSICS 1998; 60:355-76. [PMID: 9599989 DOI: 10.3758/bf03206860] [Citation(s) in RCA: 289] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The effects of perceptual learning of talker identity on the recognition of spoken words and sentences were investigated in three experiments. In each experiment, listeners were trained to learn a set of 10 talkers' voices and were then given an intelligibility test to assess the influence of learning the voices on the processing of the linguistic content of speech. In the first experiment, listeners learned voices from isolated words and were then tested with novel isolated words mixed in noise. The results showed that listeners who were given words produced by familiar talkers at test showed better identification performance than did listeners who were given words produced by unfamiliar talkers. In the second experiment, listeners learned novel voices from sentence-length utterances and were then presented with isolated words. The results showed that learning a talker's voice from sentences did not generalize well to identification of novel isolated words. In the third experiment, listeners learned voices from sentence-length utterances and were then given sentence-length utterances produced by familiar and unfamiliar talkers at test. We found that perceptual learning of novel voices from sentence-length utterances improved speech intelligibility for words in sentences. Generalization and transfer from voice learning to linguistic processing was found to be sensitive to the talker-specific information available during learning and test. These findings demonstrate that increased sensitivity to talker-specific information affects the perception of the linguistic properties of speech in isolated words and sentences.
Collapse
Affiliation(s)
- L C Nygaard
- Department of Psychology, Emory University, Atlanta, GA 30322, USA.
| | | |
Collapse
|
83
|
Lively SE, Pisoni DB. On prototypes and phonetic categories: a critical assessment of the perceptual magnet effect in speech perception. J Exp Psychol Hum Percept Perform 1998. [PMID: 9425674 DOI: 10.1037//0096-1523.23.6.1665] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
According to P. K. Kuhl (1991), a perceptual magnet effect occurs when discrimination accuracy is lower among better instances of a phonetic category than among poorer instances. Three experiments examined the perceptual magnet effect for the vowel /i/. In Experiment 1, participants rated some examples of /i/ as better instances of the category than others. In Experiment 2, no perceptual magnet effect was observed with materials based on Kuhl's tokens of /i/ or with items normed for each participant. In Experiment 3, participants labeled the vowels developed from Kuhl's test set. Many of the vowels in the nonprototype /i/ condition were not categorized as /i/s. This finding suggests that the comparisons obtained in Kuhl's original study spanned different phonetic categories.
Collapse
Affiliation(s)
- S E Lively
- Ameritech, Hoffman Estates, Illinois 60196, USA.
| | | |
Collapse
|
84
|
Loizou PC, Dorman MF, Powell V. The recognition of vowels produced by men, women, boys, and girls by cochlear implant patients using a six-channel CIS processor. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 1998; 103:1141-1149. [PMID: 9479767 DOI: 10.1121/1.421248] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Five patients who used a six-channel, continuous interleaved sampling (CIS) cochlear implant were presented vowels, in two experiments, from a large sample of men, women, boys, and girls for identification. At issue in the first experiment was whether vowels from one speaker group, i.e., men, were more identifiable than vowels from other speaker groups. At issue in the second experiment was the role of the fifth and sixth channels in the identification of vowels from the different speaker groups. It was found in experiment 1 that (i) the vowels produced by men were easier to identify than vowels produced by any of the other speaker groups, (ii) vowels from women and boys were more difficult to identify than vowels from men but less difficult than vowels from girls, and (iii) vowels from girls were more difficult to identify than vowels from all other groups. In experiment 2 removal of channels 5 and 6 from the processor impaired the identification of vowels produced by women, boys and girls but did not impair the identification of vowels produced by men. The results of experiment 1 demonstrate that scores on tests of vowels produced by men overestimate the ability of patients to recognize vowels in the broader context of multi-talker communication. The results of experiment 2 demonstrate that channels 5 and 6 become more important for vowel recognition as the second formants of the speakers increase in frequency.
Collapse
Affiliation(s)
- P C Loizou
- Department of Applied Science, University of Arkansas at Little Rock 72204-1099, USA.
| | | | | |
Collapse
|
85
|
Langereis MC, Bosman AJ, van Olphen AF, Smoorenburg GF. Changes in vowel quality in post-lingually deafened cochlear implant users. AUDIOLOGY : OFFICIAL ORGAN OF THE INTERNATIONAL SOCIETY OF AUDIOLOGY 1997; 36:279-97. [PMID: 9305524 DOI: 10.3109/00206099709071980] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The present study addresses the effect of cochlear implantation on vowel production of 20 post-lingually deafened Dutch subjects. All subjects received the Nucleus 22 implant (3 WSP and 17 MSP processors). Speech recordings were made pre-implantation and three and twelve months post-implantation with the implant switched on and off. The first and second formant frequencies were measured for eleven Dutch vowels (monophthongs only) in an h-vowel-t context. Twelve months post-implantation, the results showed an increase in the ranges of the first and second formant frequency covered by the respective vowels when the implant was switched on. The increase in the formant frequency range was most marked for some subjects with a relatively small formant range pre-implantation. Also, at 12 months post-implantation with the implant switched on we found a significant shift of the first and second formant frequency towards the normative values. Moreover, at this time the results showed significantly increased clustering of the respective vowels, suggesting an improvement in the ability to produce phonological contrasts between vowels. Clustering is defined as the ratio of the between-vowel variance of the first and second formant frequency and the within-vowel variance of three tokens of the same vowel.
Collapse
Affiliation(s)
- M C Langereis
- Department of Otorhinolaryngology, University Hospital, Utrecht, The Netherlands
| | | | | | | |
Collapse
|
86
|
Ohl FW, Scheich H. Orderly cortical representation of vowels based on formant interaction. Proc Natl Acad Sci U S A 1997; 94:9440-4. [PMID: 9256501 PMCID: PMC23209 DOI: 10.1073/pnas.94.17.9440] [Citation(s) in RCA: 72] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Psychophysical experiments have shown that the discrimination of human vowels chiefly relies on the frequency relationship of the first two peaks F1 and F2 of the vowel's spectral envelope. It has not been possible, however, to relate the two-dimensional (F1, F2)-relationship to the known organization of frequency representation in auditory cortex. We demonstrate that certain spectral integration properties of neurons are topographically organized in primary auditory cortex in such a way that a transformed (F1,F2) relationship sufficient for vowel discrimination is realized.
Collapse
Affiliation(s)
- F W Ohl
- Federal Institute for Neurobiology, Brenneckestrasse 6, D-39118 Magdeburg, Germany.
| | | |
Collapse
|
87
|
Cutler A, van Ooijen B, Norris D, Sánchez-Casas R. Speeded detection of vowels: a cross-linguistic study. PERCEPTION & PSYCHOPHYSICS 1996; 58:807-22. [PMID: 8768178 DOI: 10.3758/bf03205485] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
In four experiments, listeners' response times to detect vowel targets in spoken input were measured. The first three experiments were conducted in English. In two, one using real words and the other, nonwords, detection accuracy was low, targets in initial syllables were detected more slowly than targets in final syllables, and both response time and missed-response rate were inversely correlated with vowel duration. In a third experiment, the speech context for some subjects included all English vowels, while for others, only five relatively distinct vowels occurred. This manipulation had essentially no effect, and the same response pattern was again observed. A fourth experiment, conducted in Spanish, replicated the results in the first three experiments, except that miss rate was here unrelated to vowel duration. We propose that listeners' responses to vowel targets in naturally spoken input are effectively cautious, reflecting realistic appreciation of vowel variability in natural context.
Collapse
Affiliation(s)
- A Cutler
- MRC Applied Psychology Unit, Cambridge, England.
| | | | | | | |
Collapse
|
88
|
Hawks JW, Fourakis MS. The perceptual vowel spaces of American English and Modern Greek: a comparison. LANGUAGE AND SPEECH 1995; 38 ( Pt 3):237-252. [PMID: 8816085 DOI: 10.1177/002383099503800302] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Perceptually based vowel spaces are estimated for American English and Modern Greek by means of identifications of synthetic vowel sounds by native speakers of each language. The vowel spaces for American English appear to be organized in a sufficiently contrastive system, while Modern Greek vowels appear to be maximally contrastive. The spaces for the Modern Greek point vowels ([i], [a], [u]) fall within the spaces of their American English counterparts, while the intermediate Modern Greek vowels ([e], [o]) overlap the American English [epsilon]/[e] and [o]/[o] spaces, respectively. These results were relatively unaffected by mapping resolution and level of phonetic training and support the results of similar mappings using production data.
Collapse
Affiliation(s)
- J W Hawks
- Kent State University, School of Speech Pathology and Audiology, OH 44242-0001, USA.
| | | |
Collapse
|
89
|
Abstract
To determine how familiarity with a talker's voice affects perception of spoken words, we trained two groups of subjects to recognize a set of voices over a 9-day period. One group then identified novel words produced by the same set of talkers at four signal-to-noise ratios. Control subjects identified the same words produced by a different set of talkers. The results showed that the ability to identify a talker's voice improved intelligibility of novel words produced by that talker. The results suggest that speech perception may involve talker-contingent processes whereby perceptual learning of aspects of the vocal source facilitates the subsequent phonetic analysis of the acoustic signal.
Collapse
|
90
|
Gow DW, Gordon PC. Coming to terms with stress: effects of stress location in sentence processing. JOURNAL OF PSYCHOLINGUISTIC RESEARCH 1993; 22:545-578. [PMID: 8295163 DOI: 10.1007/bf01072936] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
The purpose of this research was to determine the role of syllabic stress in language processing during the early on-line processing of speech and later in the representation of a sentence in memory. Experiment 1 used a syllable monitoring task while Experiment 3 used a probe task in which subjects heard a sentence and then were asked to determine whether a probe syllable had occurred in the sentence. In the monitoring task, stressed syllables were detected more rapidly in word-initial position, but unstressed syllables were detected more rapidly in word-final position. Stress facilitation in initial syllables was strongly related to high relative F0, but not to changes in perceived vowel quality as assessed in Experiment 2. This pattern is interpreted as evidence that lexical stress is used on-line to guide lexical access and/or lexical segmentation. The probe task of Experiment 3 showed stress facilitation in both positions, indicating that stress is independently retained in the postperceptual representation of a sentence.
Collapse
Affiliation(s)
- D W Gow
- Neuropsychology Lab, Massachusetts General Hospital, Boston 02144
| | | |
Collapse
|
91
|
Hillenbrand J, Gayvert RT. Vowel classification based on fundamental frequency and formant frequencies. JOURNAL OF SPEECH AND HEARING RESEARCH 1993; 36:694-700. [PMID: 8377482 DOI: 10.1044/jshr.3604.694] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
A quadratic discriminant classification technique was used to classify spectral measurements from vowels spoken by men, women, and children. The parameters used to train the discriminant classifier consisted of various combinations of fundamental frequency and the three lowest formant frequencies. Several nonlinear auditory transforms were evaluated. Unlike previous studies using a linear discriminant classifier, there was no advantage in category separability for any of the nonlinear auditory transforms over a linear frequency scale, and no advantage for spectral distances over absolute frequencies. However, it was found that parameter sets using nonlinear transforms and spectral differences reduced the differences between phonetically equivalent tokens produced by different groups of talkers.
Collapse
Affiliation(s)
- J Hillenbrand
- Department of Speech Pathology and Audiology, Western Michigan University, Kalamazoo 49008
| | | |
Collapse
|
92
|
Green KP, Kuhl PK, Meltzoff AN, Stevens EB. Integrating speech information across talkers, gender, and sensory modality: female faces and male voices in the McGurk effect. PERCEPTION & PSYCHOPHYSICS 1991; 50:524-36. [PMID: 1780200 DOI: 10.3758/bf03207536] [Citation(s) in RCA: 106] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Studies of the McGurk effect have shown that when discrepant phonetic information is delivered to the auditory and visual modalities, the information is combined into a new percept not originally presented to either modality. In typical experiments, the auditory and visual speech signals are generated by the same talker. The present experiment examined whether a discrepancy in the gender of the talker between the auditory and visual signals would influence the magnitude of the McGurk effect. A male talker's voice was dubbed onto a videotape containing a female talker's face, and vice versa. The gender-incongruent videotapes were compared with gender-congruent videotapes, in which a male talker's voice was dubbed onto a male face and a female talker's voice was dubbed onto a female face. Even though there was a clear incompatibility in talker characteristics between the auditory and visual signals on the incongruent videotapes, the resulting magnitude of the McGurk effect was not significantly different for the incongruent as opposed to the congruent videotapes. The results indicate that the mechanism for integrating speech information from the auditory and the visual modalities is not disrupted by a gender incompatibility even when it is perceptually apparent. The findings are compatible with the theoretical notion that information about voice characteristics of the talker is extracted and used to normalize the speech signal at an early stage of phonetic processing, prior to the integration of the auditory and the visual information.
Collapse
|