1
|
Lester-Smith RA, Jebaily CG, Story BH. The Effects of Remote Signal Transmission and Recording on Acoustical Measures of Simulated Essential Vocal Tremor: Considerations for Remote Treatment Research and Telepractice. J Voice 2024; 38:325-336. [PMID: 34702610 PMCID: PMC9033886 DOI: 10.1016/j.jvoice.2021.09.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 09/08/2021] [Accepted: 09/09/2021] [Indexed: 10/20/2022]
Abstract
PURPOSE Studies on medical and behavioral interventions for essential vocal tremor (EVT) have shown inconsistent effects on acoustical and perceptual outcome measures across studies and across participants. Remote acoustical and perceptual assessments might facilitate studies with larger samples of participants and repeated measures that could clarify treatment effects and identify optimal treatment candidates. Furthermore, remote acoustical and perceptual assessment might allow clinicians to monitor clients' treatment responses and optimize treatment approaches during telepractice. Thus, the purpose of this study was to evaluate the accuracy of remote signal transmission and recording for acoustical and perceptual assessment of EVT. METHOD Simulations of EVT were produced using a computational model and were recorded using local and remote procedures to represent client- and clinician-end recordings respectively. Acoustical analyses measured the extent and rate of fundamental frequency (fo) and intensity modulation to represent vocal tremor severity and the cepstral peak prominence (CPPS) to represent voice quality. The data were analyzed using repeated measures analysis of variance (ANOVA) with recording as the within-subjects factor and sex of the computational model as the between-subjects factor. RESULTS There was a significant main effect of recording on the rate of fo modulation and significant interactions of recording and sex for the extent of intensity modulation, rate of intensity modulation, and CPPS. Posthoc pairwise comparisons and analysis of effect size indicated that recording procedures had the largest effect on the extent of intensity modulation for male simulations, the rate of intensity modulation for male and female simulations, and the CPPS for male and female simulations. Despite having disabled all known software and computer audio enhancing options and having stable ethernet connections, there was inconsistent attenuation of signal amplitude in remote recordings that was most problematic for samples with a breathy voice quality but also affected samples with typical and pressed voice qualities. CONCLUSIONS Acoustical measures that correlate to perception of vocal tremor and voice quality were altered by remote signal transmission and recording. In particular, signal transmission and recording in Zoom altered time-based estimates of intensity modulation and CPPS with male and female simulations of EVT and magnitude-based estimates of intensity modulation with male simulations of EVT. In contrast, signal transmission and recording in Zoom minimally altered time- and magnitude-based estimates of fo modulation with male and female simulations of EVT. Therefore, acoustical and perceptual assessments of EVT should be performed using audio recordings that are collected locally on the participant- or client-end, particularly when measuring modulation of intensity and CPP or estimating vocal tremor severity and voice quality. Development of procedures for collecting local audio recordings in remote settings may expand data collection for treatment research and enhance telepractice.
Collapse
Affiliation(s)
- Rosemary A Lester-Smith
- Department of Speech, Language, and Hearing Sciences, Moody College of Communication, The University of Texas at Austin, Austin, Texas.
| | - Charles G Jebaily
- Department of Speech, Language, and Hearing Sciences, Moody College of Communication, The University of Texas at Austin, Austin, Texas; Texas NeuroRehab Center, Austin, Texas
| | - Brad H Story
- Department of Speech, Language, and Hearing Sciences, The University of Arizona, Tucson, Arizona
| |
Collapse
|
2
|
Story BH, Bunton K. The relation of velopharyngeal coupling area and vocal tract scaling to identification of stop-nasal cognates. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 154:3741-3759. [PMID: 38099832 DOI: 10.1121/10.0023958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 11/22/2023] [Indexed: 12/18/2023]
Abstract
The purpose of this study was to determine whether the threshold of velopharyngeal (VP) coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English was different for speech produced by a model based on an adult male, an adult female, and a 4-year-old child. V1CV2 stimuli were generated with a speech production model that encodes phonetic segments as relative acoustic targets imposed on an underlying vocal tract and laryngeal structure that can be scaled according to sex and age. Each V1CV2 was synthesized with a set of VP coupling functions whose maximum area ranged from 0 to 0.1 cm2. Results showed that scaling the vocal tract and vocal folds had essentially no effect on the VP coupling area at which listener identification shifted from stop to nasal. The range of coupling areas at which the crossover occurred was 0.037-0.049 cm2 for the male model, 0.040-0.055 cm2 for the female model, and 0.039-0.052 cm2 for the 4-year-old child model, and overall mean was 0.044 cm2. Calculations of band limited peak nasalance indicated that 85% peak nasalance during the consonant was well aligned with listener responses.
Collapse
Affiliation(s)
- Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721-0071, USA
| | - Kate Bunton
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721-0071, USA
| |
Collapse
|
3
|
Barsties V Latoszek B, Englert M, Lucero JC, Behlau M. The Performance of the Acoustic Voice Quality Index and Acoustic Breathiness Index in Synthesized Voices. J Voice 2023; 37:804.e21-804.e28. [PMID: 34218968 DOI: 10.1016/j.jvoice.2021.05.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Revised: 05/07/2021] [Accepted: 05/11/2021] [Indexed: 11/25/2022]
Abstract
OBJECTIVE The aim of the present study was to investigate the performance of the Acoustic Voice Quality Index (AVQI) and the Acoustic Breathiness Index (ABI) in synthesized voice samples. METHOD The validity of the AVQI and ABI performances was analyzed in synthesized voice samples controlling the degree of predefined deviations for overall voice quality (G-scale) and breathiness (B-scale). A range of 26 synthesized voice samples with various severity degrees in G-scale with and without prominence of breathiness for male and female voices were created. RESULTS ABI received higher validity in the evaluation of breathiness than AVQI. Furthermore, ABI evaluated accurately breathiness degrees without considering roughness effects in voice samples and confirmed the findings of other studies with natural voices. Furthermore, ABI was more robust than AVQI in the evaluation of severe voice-disordered voice samples. Finally, AVQI represented moreover overall voice quality with an emphasis of breathiness evaluation and less roughness although roughness had a necessary component in overall voice quality evaluation. CONCLUSION AVQI and ABI are two robust measurements in the evaluation of voice quality. However, ABI received fewer errors than AVQI in the analyses of higher abnormalities in the voice signal. Disturbances of other subtypes of abnormal overall voice quality such as roughness were not demonstrated in the results of ABI.
Collapse
Affiliation(s)
- Ben Barsties V Latoszek
- Speech-Language Pathology, SRH University of Applied Health Sciences, Düsseldorf, Germany; Department of Phoniatrics and Pediatric Audiology, University Hospital Münster, University of Münster, Münster, Germany.
| | - Marina Englert
- Human Communication Disorders, Universidade Federal de São Paulo -UNIFESP, São Paulo, Sao Paulo, Brazil; Centro de Estudos da Voz - CEV, São Paulo ,SP, Brazil
| | - Jorge C Lucero
- Department of Computer Science, Universidade de Brasília - UnB, Brasília, Federal District, Brazil
| | - Mara Behlau
- Human Communication Disorders, Universidade Federal de São Paulo -UNIFESP, São Paulo, Sao Paulo, Brazil; Centro de Estudos da Voz - CEV, São Paulo ,SP, Brazil
| |
Collapse
|
4
|
Whitling S, Botzum HM, van Mersbergen MR. Degree of Breathiness in a Synthesized Voice Signal as it Differentiates Masculine versus Feminine Voices. J Voice 2023:S0892-1997(23)00150-9. [PMID: 37280147 DOI: 10.1016/j.jvoice.2023.04.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 04/27/2023] [Accepted: 04/27/2023] [Indexed: 06/08/2023]
Abstract
INTRODUCTION Most studies determining speakers' perceived gender as binarily female or male are reliant on F0 perception, although other vocal parameters may also contribute to the perception of gender. The current study focused on the impact of breathiness on the perception of speakers' gender as a biological variable (feminine or masculine). METHODS n = 31 normal hearing, native English speakers, 18 female, 13 male, mean age 23 (SD = 3.54), were auditorily and visually trained in and then took part in a categorical perception task. A continuum of nine samples of the word "hello", was created in an airway modulation model of speech and voice production. Resting vocal fold length, resting vocal fold thickness, F0, and vocal tract length were fixed. Glottal width at the vocal process, posterior glottal gap, and bronchial pressure were continually modified for all stimuli. Each stimulus was randomly presented 30 times within each of the five blocks (150 presentations in total). Participants rated stimuli as binarily female or male. RESULTS Showed a sigmoidal shift in breathiness along the continuum between perceived feminine or masculine voicing. This shift was evident at stimuli four and five, indicating a nonlinear, discrete perception of breathiness among participants. Response times were also significantly slower in these two stimuli, suggesting a categorical perception of breathiness among participants. CONCLUSION Breathiness created by the change in glottal width of at least 0.21 cm may influence the perception of a speaker's perceived gender.
Collapse
Affiliation(s)
- Susanna Whitling
- Department of Logopedics, Phoniatrics and Audiology, Lund University, Lund, Sweden.
| | | | | |
Collapse
|
5
|
Herbst CT, Story BH, Meyer D. Acoustical Theory of Vowel Modification Strategies in Belting. J Voice 2023:S0892-1997(23)00004-8. [PMID: 37080890 DOI: 10.1016/j.jvoice.2023.01.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 01/03/2023] [Accepted: 01/04/2023] [Indexed: 04/22/2023]
Abstract
Various authors have argued that belting is to be produced by "speech-like" sounds, with the first and second supraglottic vocal tract resonances (fR1 and fR2) at frequencies of the vowels determined by the lyrics to be sung. Acoustically, the hallmark of belting has been identified as a dominant second harmonic, possibly enhanced by first resonance tuning (fR1≈2fo). It is not clear how both these concepts - (a) phonating with "speech-like," unmodified vowels; and (b) producing a belting sound with a dominant second harmonic, typically enhanced by fR1 - can be upheld when singing across a singer's entire musical pitch range. For instance, anecdotal reports from pedagogues suggest that vowels with a low fR1, such as [i] or [u], might have to be modified considerably (by raising fR1) in order to phonate at higher pitches. These issues were systematically addressed in silico with respect to treble singing, using a linear source-filter voice production model. The dominant harmonic of the radiated spectrum was assessed in 12987 simulations, covering a parameter space of 37 fundamental frequencies (fo) across the musical pitch range from C3 to C6; 27 voice source spectral slope settings from -4 to -30 dB/octave; computed for 13 different IPA vowels. The results suggest that, for most unmodified vowels, the stereotypical belting sound characteristics with a dominant second harmonic can only be produced over a pitch range of about a musical fifth, centered at fo≈0.5fR1. In the [ɔ] and [ɑ] vowels, that range is extended to an octave, supported by a low second resonance. Data aggregation - considering the relative prevalence of vowels in American English - suggests that, historically, belting with fR1≈2fo was derived from speech, and that songs with an extended musical pitch range likely demand considerable vowel modification. We thus argue that - on acoustical grounds - the pedagogical commandment for belting with unmodified, "speech-like" vowels can not always be fulfilled.
Collapse
Affiliation(s)
- Christian T Herbst
- Janette Ogg Voice Research Center, Shenandoah Conservatory, Winchester, Virginia; Department of Vocal Studies, Mozarteum University, Salzburg, Austria.
| | - Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona
| | - David Meyer
- Janette Ogg Voice Research Center, Shenandoah Conservatory, Winchester, Virginia
| |
Collapse
|
6
|
Herbst CT, Story BH. Computer simulation of vocal tract resonance tuning strategies with respect to fundamental frequency and voice source spectral slope in singing. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3548. [PMID: 36586864 DOI: 10.1121/10.0014421] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 09/13/2022] [Indexed: 06/17/2023]
Abstract
A well-known concept of singing voice pedagogy is "formant tuning," where the lowest two vocal tract resonances ( fR1, fR2) are systematically tuned to harmonics of the laryngeal voice source to maximize the level of radiated sound. A comprehensive evaluation of this resonance tuning concept is still needed. Here, the effect of fR1, fR2 variation was systematically evaluated in silico across the entire fundamental frequency range of classical singing for three voice source characteristics with spectral slopes of -6, -12, and -18 dB/octave. Respective vocal tract transfer functions were generated with a previously introduced low-dimensional computational model, and resultant radiated sound levels were expressed in dB(A). Two distinct strategies for optimized sound output emerged for low vs high voices. At low pitches, spectral slope was the predominant factor for sound level increase, and resonance tuning only had a marginal effect. In contrast, resonance tuning strategies became more prevalent and voice source strength played an increasingly marginal role as fundamental frequency increased to the upper limits of the soprano range. This suggests that different voice classes (e.g., low male vs high female) likely have fundamentally different strategies for optimizing sound output, which has fundamental implications for pedagogical practice.
Collapse
Affiliation(s)
| | - Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85718, USA
| |
Collapse
|
7
|
Ikuma T, Story B, McWhorter AJ, Adkins L, Kunduk M. Harmonics-to-noise ratio estimation with deterministically time-varying harmonic model for pathological voice signals. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:1783. [PMID: 36182331 DOI: 10.1121/10.0014177] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 09/01/2022] [Indexed: 06/16/2023]
Abstract
The harmonics-to-noise ratio (HNR) and other spectral noise parameters are important in clinical objective voice assessment as they could indicate the presence of nonharmonic phenomena, which are tied to the perception of hoarseness or breathiness. Existing HNR estimators are built on the voice signals to be nearly periodic (fixed over a short period), although voice pathology could induce involuntary slow modulation to void this assumption. This paper proposes the use of a deterministically time-varying harmonic model to improve the HNR measurements. To estimate the time-varying model, a two-stage iterative least squares algorithm is proposed to reduce model overfitting. The efficacy of the proposed HNR estimator is demonstrated with synthetic signals, simulated tremor signals, and recorded acoustic signals. Results indicate that the proposed algorithm can produce consistent HNR measures as the extent and rate of tremor are varied.
Collapse
Affiliation(s)
- Takeshi Ikuma
- Department of Otolaryngology-Head and Neck Surgery, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112, USA
| | - Brad Story
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721, USA
| | - Andrew J McWhorter
- Department of Otolaryngology-Head and Neck Surgery, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112, USA
| | - Lacey Adkins
- Department of Otolaryngology-Head and Neck Surgery, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112, USA
| | - Melda Kunduk
- Department of Communication Disorders, Louisiana State University, Baton Rouge, Louisiana 70803, USA
| |
Collapse
|
8
|
Aichinger P, Kumar SP, Lehoux S, Švec JG. Simulated Laryngeal High-Speed Videos for the Study of Normal and Dysphonic Vocal Fold Vibration. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2022; 65:2431-2445. [PMID: 35772399 DOI: 10.1044/2022_jslhr-21-00673] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
PURPOSE Laryngeal high-speed videoendoscopy (LHSV) has been recognized as a highly valuable modality for the scientific investigations of vocal fold (VF) vibrations. In contrast to stroboscopic imaging, LHSV enables visualizing aperiodic VF vibrations. However, the technique is less well established in the clinical care of disordered voices, partly because the properties of aperiodic vibration patterns are not yet described comprehensively. To address this, a computer model for simulation of VF vibration patterns observed in a variety of different phonation types is proposed. METHOD A previously published kinematic model of mucosal wave phenomena is generalized to be capable of left-right asymmetry and to simulate endoscopic videos instead of only kymograms of VF vibrations at single sagittal positions. The most influential control parameters are the glottal halfwidths, the oscillation frequencies, the amplitudes, and the phase delays. RESULTS The presented videos demonstrate zipper-like vibration, pressed voice, voice onset, constant and time-varying left-right and anterior-posterior phase differences, as well as left-right frequency differences of the VF vibration. Video frames, videokymograms, phonovibrograms, glottal area waveforms, and waveforms of VF contact area relating to electroglottograms are shown, as well as selected kinematic parameters. CONCLUSION The presented videos demonstrate the ability to produce vibration patterns that are similar to those typically seen in endoscopic videos obtained from vocally healthy and dysphonic speakers. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.20151833.
Collapse
Affiliation(s)
- Philipp Aichinger
- Division of Phoniatrics-Logopedics, Department of Otorhinolaryngology, Medical University of Vienna, Austria
| | - S Pravin Kumar
- Department of Biomedical Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, India
| | - Sarah Lehoux
- Voice Research Laboratory, Department of Experimental Physics, Faculty of Science, Palacký University, Olomouc, Czech Republic
| | - Jan G Švec
- Voice Research Laboratory, Department of Experimental Physics, Faculty of Science, Palacký University, Olomouc, Czech Republic
- Voice and Hearing Centre Prague, Medical Healthcom, Ltd., Czech Republic
| |
Collapse
|
9
|
Contribution of Vocal Tract and Glottal Source Spectral Cues in the Generation of Acted Happy and Aggressive Spanish Vowels. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12042055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/10/2022]
Abstract
The source-filter model is one of the main techniques applied to speech analysis and synthesis. Recent advances in voice production by means of three-dimensional (3D) source-filter models have overcome several limitations of classic one-dimensional techniques. Despite the development of preliminary attempts to improve the expressiveness of 3D-generated voices, they are still far from achieving realistic results. Towards this goal, this work analyses the contribution of both the the vocal tract (VT) and the glottal source spectral (GSS) cues in the generation of happy and aggressive speech through a GlottDNN-based analysis-by-synthesis methodology. Paired neutral expressive utterances are parameterised to generate different combinations of expressive vowels, applying the target expressive GSS and/or VT cues on the neutral vowels after transplanting the expressive prosody on these utterances. The conducted objective tests focused on Spanish [a], [i] and [u] vowels show that both GSS and VT cues significantly reduce the spectral distance to the expressive target. The results from the perceptual test show that VT cues make a statistically significant contribution in the expression of happy and aggressive emotions for [a] vowels, while the GSS contribution is significant in [i] and [u] vowels.
Collapse
|
10
|
Story BH, Bunton K. The relation of velopharyngeal coupling area to the identification of stop versus nasal consonants in North American English based on speech generated by acoustically driven vocal tract modulations. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:3618. [PMID: 34852618 DOI: 10.1121/10.0007223] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/23/2021] [Indexed: 06/13/2023]
Abstract
The purpose of this study was to determine the threshold of velopharyngeal coupling area at which listeners switch from identifying a consonant as a stop to a nasal in North American English, based on V1CV2 stimuli generated with a speech production model that encodes phonetic segments as relative acoustic targets. Each V1CV2 was synthesized with a set of velopharyngeal coupling functions whose area ranged from 0 to 0.1 cm2. Results show that consonants were identified by listeners as a stop when the coupling area was less than 0.035-0.057 cm2, depending on place of articulation and final vowel. The smallest coupling area (0.035 cm2) at which the stop-to-nasal switch occurred was found for an alveolar consonant in the /ɑCi/ context, whereas the largest (0.057 cm2) was for a bilabial in /ɑCɑ/. For each stimulus, the balance of oral versus nasal acoustic energy was characterized by the peak nasalance during the consonant. Stimuli with peak nasalance below 40% were mostly identified by listeners as stops, whereas those above 40% were identified as nasals. This study was intended to be a precursor to further investigations using the same model but scaled to represent the developing speech production system of male and female talkers.
Collapse
Affiliation(s)
- Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721-0071, USA
| | - Kate Bunton
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721-0071, USA
| |
Collapse
|
11
|
Geng B, Movahhedi M, Xue Q, Zheng X. Vocal fold vibration mode changes due to cricothyroid and thyroarytenoid muscle interaction in a three-dimensional model of the canine larynx. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:1176. [PMID: 34470336 DOI: 10.1121/10.0005883] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 07/26/2021] [Indexed: 06/13/2023]
Abstract
Using a continuum model based on magnetic resonance imaging of a canine larynx, parametric simulations of the vocal fold vibration during phonation were conducted with the cricothyroid muscle (CT) and the thyroarytenoid muscle (TA) independently activated from zero to full activation. The fundamental frequency (f0) first increased and then experienced a downward jump as TA activity gradually increased under moderate to high CT activation. Proper orthogonal decomposition analysis revealed that the vocal fold vibrations were dominated by two modes representing a lateral motion and rotational motion, respectively, and the f0 drop was associated with a switch on the order of the two modes. In another parametric set where only the vocalis was active, f0 increased monotonically with both TA and CT activity and the mode switch did not occur. The results suggested that the active stress in the TA, which causes large stress differences between the body and cover, is essential for the occurrence of the rotational mode and mode switch. Relatively greater TA activity tends to promote the rotational mode, while relatively greater CT activity tends to promote the lateral mode. The results also suggested that the vibration modes affected f0 by affecting the contribution of the TA stress to the effective stiffness. The switch in the dominant mode caused the non-monotonic change of f0.
Collapse
Affiliation(s)
- Biao Geng
- Department of Mechanical Engineering, University of Maine, Orono, Maine 04473, USA
| | | | - Qian Xue
- Department of Mechanical Engineering, University of Maine, Orono, Maine 04473, USA
| | - Xudong Zheng
- Department of Mechanical Engineering, University of Maine, Orono, Maine 04473, USA
| |
Collapse
|
12
|
Taitz A, Assaneo MF, Shalom DE, Trevisan MA. Motor representations underlie the reading of unfamiliar letter combinations. Sci Rep 2020; 10:3828. [PMID: 32123186 PMCID: PMC7052247 DOI: 10.1038/s41598-020-59199-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 12/13/2019] [Indexed: 12/03/2022] Open
Abstract
Silent reading is a cognitive operation that produces verbal content with no vocal output. One relevant question is the extent to which this verbal content is processed as overt speech in the brain. To address this, we acquired sound, eye trajectories and lips' dynamics during the reading of consonant-consonant-vowel (CCV) combinations which are infrequent in the language. We found that the duration of the first fixations on the CCVs during silent reading correlate with the duration of the transitions between consonants when the CCVs are actually uttered. With the aid of an articulatory model of the vocal system, we show that transitions measure the articulatory effort required to produce the CCVs. This means that first fixations during silent reading are lengthened when the CCVs require a greater laryngeal and/or articulatory effort to be pronounced. Our results support that a speech motor code is used for the recognition of infrequent text strings during silent reading.
Collapse
Affiliation(s)
- Alan Taitz
- Physics Institute of Buenos Aires (IFIBA) CONICET, Buenos Aires, Argentina.
| | - M Florencia Assaneo
- Department of Psychology, New York University, New York, NY, 10003, USA
- Instituto de Neurobiología, UNAM, Campus Juriquilla, Querétaro, México
| | - Diego E Shalom
- Physics Institute of Buenos Aires (IFIBA) CONICET, Buenos Aires, Argentina
- Department of Physics, University of Buenos Aires (UBA), Buenos Aires, 1428EGA, Argentina
| | - Marcos A Trevisan
- Physics Institute of Buenos Aires (IFIBA) CONICET, Buenos Aires, Argentina
- Department of Physics, University of Buenos Aires (UBA), Buenos Aires, 1428EGA, Argentina
| |
Collapse
|
13
|
Lee J, Rodriguez E, Mefferd A. Direction-Specific Jaw Dysfunction and Its Impact on Tongue Movement in Individuals With Dysarthria Secondary to Amyotrophic Lateral Sclerosis. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2020; 63:499-508. [PMID: 32074462 DOI: 10.1044/2019_jslhr-19-00174] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Purpose The current study tested jaw movement characteristics and their impact on tongue movement for speech production in individuals with amyotrophic lateral sclerosis (ALS). Specifically, the study examined tongue and jaw movement in multiple directions during jaw opening and closing strokes in individuals with ALS and controls. Method Twenty-two individuals with ALS and 22 controls participated in the current study. Tongue and jaw movements during the production of the words "Iowa" and "Ohio" (produced in a carrier phrase) were recorded using electromagnetic articulography. Tongue and jaw distances were measured for jaw opening and closing strokes. Distance was measured in the anterior-posterior and superior-inferior dimensions (retraction, advancement, lowering, and raising). Results Findings revealed that individuals with ALS exaggerated their jaw opening movements, but not their jaw closing movements, compared to controls. Between the groups, a comparable tongue lowering distance was observed during jaw opening movements. In contrast, reduced tongue raising was observed during the jaw closing movements in individuals with ALS compared to controls. Conclusion The findings suggest that individuals with ALS produce excessive jaw opening movements in the absence of excessive jaw closing movements. The lack of excessive jaw closing movements results in reduced tongue raising in these individuals. Excessive jaw opening movements alone suggest a direction-specific jaw dysfunction. Future studies should examine whether excessive jaw raising can be facilitated and if it enhances tongue raising movement for speech production in individuals with dysarthria secondary to ALS.
Collapse
Affiliation(s)
- Jimin Lee
- Department of Communication Sciences and Disorders, The Pennsylvania State University, University Park
| | - Elizabeth Rodriguez
- Department of Communication Sciences and Disorders, The Pennsylvania State University, University Park
| | - Antje Mefferd
- Department of Hearing and Speech Sciences, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
14
|
Bergevin C, Narayan C, Williams J, Mhatre N, Steeves JK, Bernstein JG, Story B. Overtone focusing in biphonic tuvan throat singing. eLife 2020; 9:50476. [PMID: 32048990 PMCID: PMC7064340 DOI: 10.7554/elife.50476] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/31/2020] [Indexed: 11/13/2022] Open
Abstract
Khoomei is a unique singing style originating from the republic of Tuva in central Asia. Singers produce two pitches simultaneously: a booming low-frequency rumble alongside a hovering high-pitched whistle-like tone. The biomechanics of this biphonation are not well-understood. Here, we use sound analysis, dynamic magnetic resonance imaging, and vocal tract modeling to demonstrate how biphonation is achieved by modulating vocal tract morphology. Tuvan singers show remarkable control in shaping their vocal tract to narrowly focus the harmonics (or overtones) emanating from their vocal cords. The biphonic sound is a combination of the fundamental pitch and a focused filter state, which is at the higher pitch (1-2 kHz) and formed by merging two formants, thereby greatly enhancing sound-production in a very narrow frequency range. Most importantly, we demonstrate that this biphonation is a phenomenon arising from linear filtering rather than from a nonlinear source.
Collapse
Affiliation(s)
- Christopher Bergevin
- Physics and Astronomy, York University, Toronto, Canada.,Centre for Vision Research, York University, Toronto, Canada.,Fields Institute for Research in Mathematical Sciences, Toronto, Canada.,Kavli Institute of Theoretical Physics, University of California, Santa Barbara, United States
| | - Chandan Narayan
- Languages, Literatures and Linguistics, York University, Toronto, Canada
| | - Joy Williams
- York MRI Facility, York University, Toronto, Canada
| | | | - Jennifer Ke Steeves
- Centre for Vision Research, York University, Toronto, Canada.,Psychology, York University, Toronto, Canada
| | - Joshua Gw Bernstein
- National Military Audiology & Speech Pathology Center, Walter Reed National Military Medical Center, Bethesda, United States
| | - Brad Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, United States
| |
Collapse
|
15
|
Pont A, Guasch O, Arnela M. Finite element generation of sibilants /s/ and /z/ using random distributions of Kirchhoff vortices. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2020; 36:e3302. [PMID: 31883313 DOI: 10.1002/cnm.3302] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 09/20/2019] [Accepted: 12/20/2019] [Indexed: 06/10/2023]
Abstract
The numerical simulation of sibilant sounds in three-dimensional realistic vocal tracts constitutes a challenging problem because it involves a wide range of turbulent flow scales. Rotating eddies generate acoustic waves whose wavelengths are inversely proportional to the flow local Mach number. If that is low, very fine meshes are required to capture the flow dynamics. In standard hybrid computational aeroacoustics (CAA), where the incompressible Navier-Stokes equations are first solved to get a source term that is secondly input into an acoustic wave equation, this implies resorting to supercomputer facilities. As a consequence, only very short time intervals of the sibilant can be produced, which may be enough for its spectral characterization but insufficient to synthesize, for instance, an audio file from it or a syllable sound. In this work, we propose to substitute the aeroacoustic source term obtained from the computational fluid dynamics (CFD) in the first step of hybrid CAA, by a random distribution of Kirchhoff's spinning vortices, located in the region between the upper incisors and the lower lip. In this way, one only needs to solve a linear wave equation to generate a sibilant, and therefore avoids the costly large-scale computations. We show that our proposal can recover the outcomes of hybrid CAA simulations in average, and that it can be applied to generate sibilants /s/ and /z/. Modeling and implementation details of the Kirchhoff vortex distribution in a stabilized finite element code are discussed in the paper, as well as the outcomes of the simulations.
Collapse
Affiliation(s)
- Arnau Pont
- GTM Grup de Recerca en Tecnologies Mèdia, La Salle-Universitat Ramon Llull, Barcelona, Spain
| | - Oriol Guasch
- GTM Grup de Recerca en Tecnologies Mèdia, La Salle-Universitat Ramon Llull, Barcelona, Spain
| | - Marc Arnela
- GTM Grup de Recerca en Tecnologies Mèdia, La Salle-Universitat Ramon Llull, Barcelona, Spain
| |
Collapse
|
16
|
Alexander R, Sorensen T, Toutios A, Narayanan S. A modular architecture for articulatory synthesis from gestural specification. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:4458. [PMID: 31893678 PMCID: PMC7043897 DOI: 10.1121/1.5139413] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 09/19/2019] [Accepted: 11/11/2019] [Indexed: 06/10/2023]
Abstract
This paper proposes a modular architecture for articulatory synthesis from a gestural specification comprising relatively simple models for the vocal tract, the glottis, aero-acoustics, and articulatory control. The vocal tract module combines a midsagittal statistical analysis articulatory model, derived by factor analysis of air-tissue boundaries in real-time magnetic resonance imaging data, with an αβ model for converting midsagittal section to area function specifications. The aero-acoustics and glottis models were based on a software implementation of classic work by Maeda. The articulatory control module uses dynamical systems, which implement articulatory gestures, to animate the statistical articulatory model, inspired by the task dynamics model. Results on synthesizing vowel-consonant-vowel sequences with plosive consonants, using models that were built on data from, and simulate the behavior of, two different speakers are presented.
Collapse
Affiliation(s)
- Rachel Alexander
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California 90007, USA
| | - Tanner Sorensen
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California 90007, USA
| | - Asterios Toutios
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California 90007, USA
| | - Shrikanth Narayanan
- Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California 90007, USA
| |
Collapse
|
17
|
Glottal Source Contribution to Higher Order Modes in the Finite Element Synthesis of Vowels. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9214535] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Articulatory speech synthesis has long been based on one-dimensional (1D) approaches. They assume plane wave propagation within the vocal tract and disregard higher order modes that typically appear above 5 kHz. However, such modes may be relevant in obtaining a more natural voice, especially for phonation types with significant high frequency energy (HFE) content. This work studies the contribution of the glottal source at high frequencies in the 3D numerical synthesis of vowels. The spoken vocal range is explored using an LF (Liljencrants–Fant) model enhanced with aspiration noise and controlled by the R d glottal shape parameter. The vowels [ɑ], [i], and [u] are generated with a finite element method (FEM) using realistic 3D vocal tract geometries obtained from magnetic resonance imaging (MRI), as well as simplified straight vocal tracts of a circular cross-sectional area. The symmetry of the latter prevents the onset of higher order modes. Thus, the comparison between realistic and simplified geometries enables us to analyse the influence of such modes. The simulations indicate that higher order modes may be perceptually relevant, particularly for tense phonations (lower R d values) and/or high fundamental frequency values, F 0 s. Conversely, vowels with a lax phonation and/or low F0s may result in inaudible HFE levels, especially if aspiration noise is not considered in the glottal source model.
Collapse
|
18
|
Story BH, Bunton K. A model of speech production based on the acoustic relativity of the vocal tract. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:2522. [PMID: 31671993 PMCID: PMC7064311 DOI: 10.1121/1.5127756] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 09/10/2019] [Accepted: 09/12/2019] [Indexed: 06/10/2023]
Abstract
A model is described in which the effects of articulatory movements to produce speech are generated by specifying relative acoustic events along a time axis. These events consist of directional changes of the vocal tract resonance frequencies that, when associated with a temporal event function, are transformed via acoustic sensitivity functions, into time-varying modulations of the vocal tract shape. Because the time course of the events may be considerably overlapped in time, coarticulatory effects are automatically generated. Production of sentence-level speech with the model is demonstrated with audio samples and vocal tract animations.
Collapse
Affiliation(s)
- Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721, USA
| | - Kate Bunton
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721, USA
| |
Collapse
|
19
|
Qureshi TM, Syed KS. Improved vocal tract model for the elongation of segment lengths in a real time. COMPUT SPEECH LANG 2019. [DOI: 10.1016/j.csl.2019.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
20
|
Story BH, Vorperian HK, Bunton K, Durtschi RB. An age-dependent vocal tract model for males and females based on anatomic measurements. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 143:3079. [PMID: 29857736 PMCID: PMC5966313 DOI: 10.1121/1.5038264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Revised: 04/29/2018] [Accepted: 05/01/2018] [Indexed: 05/29/2023]
Abstract
The purpose of this study was to take a first step toward constructing a developmental and sex-specific version of a parametric vocal tract area function model representative of male and female vocal tracts ranging in age from infancy to 12 yrs, as well as adults. Anatomic measurements collected from a large imaging database of male and female children and adults provided the dataset from which length warping and cross-dimension scaling functions were derived, and applied to the adult-based vocal tract model to project it backward along an age continuum. The resulting model was assessed qualitatively by projecting hypothetical vocal tract shapes onto midsagittal images from the cohort of children, and quantitatively by comparison of formant frequencies produced by the model to those reported in the literature. An additional validation of modeled vocal tract shapes was made possible by comparison to cross-sectional area measurements obtained for children and adults using acoustic pharyngometry. This initial attempt to generate a sex-specific developmental vocal tract model paves a path to study the relation of vocal tract dimensions to documented prepubertal acoustic differences.
Collapse
Affiliation(s)
- Brad H Story
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85718, USA
| | - Houri K Vorperian
- Vocal Tract Development Lab, Waisman Center, University of Wisconsin-Madison, 1500 Highland Avenue # 429, Madison, Wisconsin 53705, USA
| | - Kate Bunton
- Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85718, USA
| | - Reid B Durtschi
- Vocal Tract Development Lab, Waisman Center, University of Wisconsin-Madison, 1500 Highland Avenue # 429, Madison, Wisconsin 53705, USA
| |
Collapse
|
21
|
Bunton K. Effects of nasal port area on perception of nasality and measures of nasalance based on computational modeling. Cleft Palate Craniofac J 2018; 52:110-4. [PMID: 24437587 DOI: 10.1597/13-126] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
OBJECTIVE This study examined the relation between nasal port area, nasalance, and perceptual ratings of nasality for three English corner vowels, /i/, /u/, and /a/. DESIGN Samples were simulated using a computational model that allowed for exact control of nasal port size and direct measures of nasalance. Perceptual ratings were obtained using a paired stimulus presentation. PARTICIPANTS Four experienced listeners. MAIN OUTCOME MEASURES Nasalance and perceptual ratings of nasality. RESULTS Findings show that perceptual ratings of nasality and nasalance increased for samples generated with nasal port areas up to and including 0.16 cm(2) but plateaued in samples generated with larger nasal port areas. No vowel differences were noted for perceptual ratings. CONCLUSIONS This work extends previously published work by including nasal port areas representative of those reported in the literature for clinical populations. Continued work using samples with varied phonetic context and varying suprasegmental and temporal characteristics are needed.
Collapse
|
22
|
Elie B, Laprie Y. Simulating alveolar trills using a two-mass model of the tongue tip. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 142:3245. [PMID: 29195472 DOI: 10.1121/1.5012688] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper investigates the possibility of reproducing the self-sustained oscillation of the tongue tip in alveolar trills. The interest is to study the articulatory and phonatory configurations that are required to produce alveolar trills. Using a realistic geometry of the vocal tract, derived from cineMRI data of a real speaker, the paper studies the mechanical behavior of a lumped two-mass model of the tongue tip. Then, the paper proposes a solution to simulate the incomplete occlusion of the vocal tract during linguopalatal contacts by adding a lateral acoustic waveguide. Finally, the simulation framework is used to study the impact of a set of parameters on the characteristic features of the produced alveolar trills. It shows that the production of trills is favored when the distance between the equilibrium position of the tongue tip and the hard palate in the alveolar zone is less than 1 mm, but without linguopalatal contact, and when the glottis is fully adducted.
Collapse
Affiliation(s)
- Benjamin Elie
- Laboratoire Lorrain de Recherche en Informatique et ses Applications, Institut National de Recherche en Informatique et en Automatique/Centre National de la Recherche Scientifique/Université de Lorraine, Vandoeuvre-les-Nancy, France
| | - Yves Laprie
- Laboratoire Lorrain de Recherche en Informatique et ses Applications, Institut National de Recherche en Informatique et en Automatique/Centre National de la Recherche Scientifique/Université de Lorraine, Vandoeuvre-les-Nancy, France
| |
Collapse
|
23
|
Elie B, Laprie Y. Acoustic impact of the gradual glottal abduction degree on the production of fricatives: A numerical study. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 142:1303. [PMID: 28964087 DOI: 10.1121/1.5000232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The paper presents a numerical study about the acoustic impact of the gradual glottal opening on the production of fricatives. Sustained fricatives are simulated by using classic lumped circuit element methods to compute the propagation of the acoustic wave along the vocal tract. A recent glottis model is connected to the wave solver to simulate a partial abduction of the vocal folds during their self-oscillating cycles. Area functions of fricatives at the three places of articulation of French have been extracted from static MRI acquisitions. Simulations highlight the existence of three distinct regimes, named A, B, and C, depending on the degree of abduction of the glottis. They are characterized by the frication noise level: A exhibits a low frication noise level, B, which is a transitional unstable regime, is a mixed noise/voice signal, and C contains only frication noise. They have significant impacts on the first spectral moments. Simulations show that their boundaries depend on articulatory and glottal configurations. The transition regime B is shown to be unstable: it requires very specific configurations in comparison with other regimes, and acoustic features are very sensitive to small perturbations of the glottal configuration abduction in this regime.
Collapse
Affiliation(s)
- Benjamin Elie
- Laboratoire Lorrain de Recherche en Informatique et ses Applications, l'Institut National de Recherche en Informatique et en Automatique, Centre National de la Recherche Scientifique, Université de Lorraine, Vandoeuvre-les-Nancy, France
| | - Yves Laprie
- Laboratoire Lorrain de Recherche en Informatique et ses Applications, l'Institut National de Recherche en Informatique et en Automatique, Centre National de la Recherche Scientifique, Université de Lorraine, Vandoeuvre-les-Nancy, France
| |
Collapse
|
24
|
Alzamendi GA, Schlotthauer G. Modeling and joint estimation of glottal source and vocal tract filter by state-space methods. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2016.12.022] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
25
|
Story BH, Bunton K. An acoustically-driven vocal tract model for stop consonant production. SPEECH COMMUNICATION 2017; 87:1-17. [PMID: 28093574 PMCID: PMC5234468 DOI: 10.1016/j.specom.2016.12.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The purpose of this study was to further develop a multi-tier model of the vocal tract area function in which the modulations of shape to produce speech are generated by the product of a vowel substrate and a consonant superposition function. The new approach consists of specifying input parameters for a target consonant as a set of directional changes in the resonance frequencies of the vowel substrate. Using calculations of acoustic sensitivity functions, these "resonance deflection patterns" are transformed into time-varying deformations of the vocal tract shape without any direct specification of location or extent of the consonant constriction along the vocal tract. The configuration of the constrictions and expansions that are generated by this process were shown to be physiologically-realistic and produce speech sounds that are easily identifiable as the target consonants. This model is a useful enhancement for area function-based synthesis and can serve as a tool for understanding how the vocal tract is shaped by a talker during speech production.
Collapse
|
26
|
Assaneo MF, Sitt J, Varoquaux G, Sigman M, Cohen L, Trevisan MA. Exploring the anatomical encoding of voice with a mathematical model of the vocal system. Neuroimage 2016; 141:31-39. [DOI: 10.1016/j.neuroimage.2016.07.033] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2016] [Revised: 06/12/2016] [Accepted: 07/14/2016] [Indexed: 11/28/2022] Open
|
27
|
Bocquelet F, Hueber T, Girin L, Savariaux C, Yvert B. Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces. PLoS Comput Biol 2016; 12:e1005119. [PMID: 27880768 PMCID: PMC5120792 DOI: 10.1371/journal.pcbi.1005119] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2016] [Accepted: 08/25/2016] [Indexed: 11/29/2022] Open
Abstract
Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.
Collapse
Affiliation(s)
- Florent Bocquelet
- INSERM, BrainTech Laboratory U1205, Grenoble, France
- Univ. Grenoble Alpes, BrainTech Laboratory U1205, Grenoble, France
- CNRS, GIPSA-Lab, Saint-Martin-d'Hères, France
- Univ. Grenoble Alpes, GIPSA-Lab, Saint-Martin-d'Hères, France
| | - Thomas Hueber
- CNRS, GIPSA-Lab, Saint-Martin-d'Hères, France
- Univ. Grenoble Alpes, GIPSA-Lab, Saint-Martin-d'Hères, France
| | - Laurent Girin
- Univ. Grenoble Alpes, GIPSA-Lab, Saint-Martin-d'Hères, France
- INRIA Grenoble Rhône-Alpes, Montbonnot, France
| | - Christophe Savariaux
- CNRS, GIPSA-Lab, Saint-Martin-d'Hères, France
- Univ. Grenoble Alpes, GIPSA-Lab, Saint-Martin-d'Hères, France
| | - Blaise Yvert
- INSERM, BrainTech Laboratory U1205, Grenoble, France
- Univ. Grenoble Alpes, BrainTech Laboratory U1205, Grenoble, France
| |
Collapse
|
28
|
Abstract
OBJECTIVE The goal of the Arizona Child Acoustic Database project was to obtain a large set of acoustic recordings, primarily vowels, collected from a cohort of children over a critical period of growth and development. METHOD Data was recorded longitudinally from 63 children between the ages of 2;0 and 7;0 at 3-month intervals. The protocol included individual American English vowels and diphthongs, nonsense multi-vowel transitions, word level multi-vowel sequences (e.g., Hawaii), single-syllable words targeting each American English vowel, short sentences, and conversation. RESULTS Acoustic files are available for download through the University of Arizona Library Repository for use in future research projects. CONCLUSION Longitudinal recordings may be of interest because they allow tracking of acoustic characteristics produced by an individual child during a period of rapid growth and speech development.
Collapse
Affiliation(s)
- Kate Bunton
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, AZ, USA
| | | |
Collapse
|
29
|
Arnela M, Dabbaghchian S, Blandin R, Guasch O, Engwall O, Van Hirtum A, Pelorson X. Influence of vocal tract geometry simplifications on the numerical simulation of vowel sounds. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 140:1707. [PMID: 27914393 DOI: 10.1121/1.4962488] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
For many years, the vocal tract shape has been approximated by one-dimensional (1D) area functions to study the production of voice. More recently, 3D approaches allow one to deal with the complex 3D vocal tract, although area-based 3D geometries of circular cross-section are still in use. However, little is known about the influence of performing such a simplification, and some alternatives may exist between these two extreme options. To this aim, several vocal tract geometry simplifications for vowels [ɑ], [i], and [u] are investigated in this work. Six cases are considered, consisting of realistic, elliptical, and circular cross-sections interpolated through a bent or straight midline. For frequencies below 4-5 kHz, the influence of bending and cross-sectional shape has been found weak, while above these values simplified bent vocal tracts with realistic cross-sections are necessary to correctly emulate higher-order mode propagation. To perform this study, the finite element method (FEM) has been used. FEM results have also been compared to a 3D multimodal method and to a classical 1D frequency domain model.
Collapse
Affiliation(s)
- Marc Arnela
- GTM-Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, C/Quatre Camins 30, Barcelona, E-08022, Catalonia, Spain
| | - Saeed Dabbaghchian
- Department of Speech, Music and Hearing, School of Computer Science & Communication, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Rémi Blandin
- GIPSA-lab, Unité Mixte de Recherche au Centre National de la Recherche Scientifique 5216, Grenoble Campus, St. Martin d'Heres, F-38402, France
| | - Oriol Guasch
- GTM-Grup de recerca en Tecnologies Mèdia, La Salle, Universitat Ramon Llull, C/Quatre Camins 30, Barcelona, E-08022, Catalonia, Spain
| | - Olov Engwall
- Department of Speech, Music and Hearing, School of Computer Science & Communication, Kungliga Tekniska högskolan Royal Institute of Technology, Stockholm, Sweden
| | - Annemie Van Hirtum
- GIPSA-lab, Unité Mixte de Recherche au Centre National de la Recherche Scientifique 5216, Grenoble Campus, St. Martin d'Heres, F-38402, France
| | - Xavier Pelorson
- GIPSA-lab, Unité Mixte de Recherche au Centre National de la Recherche Scientifique 5216, Grenoble Campus, St. Martin d'Heres, F-38402, France
| |
Collapse
|
30
|
Story BH, Bunton K. Formant measurement in children's speech based on spectral filtering. SPEECH COMMUNICATION 2016; 76:93-111. [PMID: 26855461 PMCID: PMC4743040 DOI: 10.1016/j.specom.2015.11.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
Children's speech presents a challenging problem for formant frequency measurement. In part, this is because high fundamental frequencies, typical of a children's speech production, generate widely spaced harmonic components that may undersample the spectral shape of the vocal tract transfer function. In addition, there is often a weakening of upper harmonic energy and a noise component due to glottal turbulence. The purpose of this study was to develop a formant measurement technique based on cepstral analysis that does not require modification of the cepstrum itself or transformation back to the spectral domain. Instead, a narrow-band spectrum is low-pass filtered with a cutoff point (i.e., cutoff "quefrency" in the terminology of cepstral analysis) to preserve only the spectral envelope. To test the method, speech representative of a 2-3 year-old child was simulated with an airway modulation model of speech production. The model, which includes physiologically-scaled vocal folds and vocal tract, generates sound output analogous to a microphone signal. The vocal tract resonance frequencies can be calculated independently of the output signal and thus provide test cases that allow for assessing the accuracy of the formant tracking algorithm. When applied to the simulated child-like speech, the spectral filtering approach was shown to provide a clear spectrographic representation of formant change over the time course of the signal, and facilitates tracking formant frequencies for further analysis.
Collapse
Affiliation(s)
- Brad H. Story
- Speech Acoustics Laboratory, Department of Speech, Language, and Hearing Sciences, University of Arizona, P.O. Box 210071, Tucson, AZ 85721
| | - Kate Bunton
- Speech Acoustics Laboratory, Department of Speech, Language, and Hearing Sciences, University of Arizona, P.O. Box 210071, Tucson, AZ 85721
| |
Collapse
|
31
|
Lester RA, Story BH. The effects of physiological adjustments on the perceptual and acoustical characteristics of simulated laryngeal vocal tremor. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 138:953-63. [PMID: 26328711 PMCID: PMC4545074 DOI: 10.1121/1.4927561] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
The purpose of this study was to determine if adjustments to the voice source [i.e., fundamental frequency (F0), degree of vocal fold adduction] or vocal tract filter (i.e., vocal tract shape for vowels) reduce the perception of simulated laryngeal vocal tremor and to determine if listener perception could be explained by characteristics of the acoustical modulations. This research was carried out using a computational model of speech production that allowed for precise control and manipulation of the glottal and vocal tract configurations. Forty-two healthy adults participated in a perceptual study involving pair-comparisons of the magnitude of "shakiness" with simulated samples of laryngeal vocal tremor. Results revealed that listeners perceived a higher magnitude of voice modulation when simulated samples had a higher mean F0, greater degree of vocal fold adduction, and vocal tract shape for /i/ vs /ɑ/. However, the effect of F0 was significant only when glottal noise was not present in the acoustic signal. Acoustical analyses were performed with the simulated samples to determine the features that affected listeners' judgments. Based on regression analyses, listeners' judgments were predicted to some extent by modulation information present in both low and high frequency bands.
Collapse
Affiliation(s)
- Rosemary A Lester
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721, USA
| | - Brad H Story
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona 85721, USA
| |
Collapse
|
32
|
Carbonell KM, Lester RA, Story BH, Lotto AJ. Discriminating simulated vocal tremor source using amplitude modulation spectra. J Voice 2014; 29:140-7. [PMID: 25532813 DOI: 10.1016/j.jvoice.2014.07.020] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Accepted: 07/31/2014] [Indexed: 11/28/2022]
Abstract
OBJECTIVES/HYPOTHESIS Sources of vocal tremor are difficult to categorize perceptually and acoustically. This article describes a preliminary attempt to discriminate vocal tremor sources through the use of spectral measures of the amplitude envelope. The hypothesis is that different vocal tremor sources are associated with distinct patterns of acoustic amplitude modulations. STUDY DESIGN Statistical categorization methods (discriminant function analysis) were used to discriminate signals from simulated vocal tremor with different sources using only acoustic measures derived from the amplitude envelopes. METHODS Simulations of vocal tremor were created by modulating parameters of a vocal fold model corresponding to oscillations of respiratory driving pressure (respiratory tremor), degree of vocal fold adduction (adductory tremor), and fundamental frequency of vocal fold vibration (F0 tremor). The acoustic measures were based on spectral analyses of the amplitude envelope computed across the entire signal and within select frequency bands. RESULTS The signals could be categorized (with accuracy well above chance) in terms of the simulated tremor source using only measures of the amplitude envelope spectrum even when multiple sources of tremor were included. CONCLUSIONS These results supply initial support for an amplitude-envelope-based approach to identify the source of vocal tremor and provide further evidence for the rich information about talker characteristics present in the temporal structure of the amplitude envelope.
Collapse
Affiliation(s)
- Kathy M Carbonell
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona.
| | - Rosemary A Lester
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona
| | - Brad H Story
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona
| | - Andrew J Lotto
- Department of Speech, Language, and Hearing Sciences, University of Arizona, Tucson, Arizona
| |
Collapse
|
33
|
Alku P, Pohjalainen J, Vainio M, Laukkanen AM, Story BH. Formant frequency estimation of high-pitched vowels using weighted linear prediction. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:1295-313. [PMID: 23927127 DOI: 10.1121/1.4812756] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
All-pole modeling is a widely used formant estimation method, but its performance is known to deteriorate for high-pitched voices. In order to address this problem, several all-pole modeling methods robust to fundamental frequency have been proposed. This study compares five such previously known methods and introduces a technique, Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME). WLP-AME utilizes temporally weighted linear prediction (LP) in which the square of the prediction error is multiplied by a given parametric weighting function. The weighting downgrades the contribution of the main excitation of the vocal tract in optimizing the filter coefficients. Consequently, the resulting all-pole model is affected more by the characteristics of the vocal tract leading to less biased formant estimates. By using synthetic vowels created with a physical modeling approach, the results showed that WLP-AME yields improved formant frequencies for high-pitched sounds in comparison to the previously known methods (e.g., relative error in the first formant of the vowel [a] decreased from 11% to 3% when conventional LP was replaced with WLP-AME). Experiments conducted on natural vowels indicate that the formants detected by WLP-AME changed in a more regular manner between repetitions of different pitch than those computed by conventional LP.
Collapse
Affiliation(s)
- Paavo Alku
- Department of Signal Processing and Acoustics, Aalto University, P.O. Box 13000, FI-00076 Aalto, Finland.
| | | | | | | | | |
Collapse
|