26
|
Baucom BR, Sheng E, Christensen A, Georgiou PG, Narayanan SS, Atkins DC. Behaviorally-based couple therapies reduce emotional arousal during couple conflict. Behav Res Ther 2015; 72:49-55. [PMID: 26183021 DOI: 10.1016/j.brat.2015.06.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 06/28/2015] [Accepted: 06/30/2015] [Indexed: 11/18/2022]
Abstract
Emotional arousal during relationship conflict is a major target for intervention in couple therapies. The current study examines changes in conflict-related emotional arousal in 104 couples that participated in a randomized clinical trial of two behaviorally-based couple therapies. Emotional arousal is measured using mean fundamental frequency of spouse's speech, and changes in emotional arousal from pre-to post-therapy are examined using multilevel models. Overall emotional arousal, the rate of increase in emotional arousal at the beginning of conflict, and the duration of emotional arousal declined for all couples. Reductions in overall arousal were stronger for TBCT wives than for IBCT wives but not significantly different for IBCT and TBCT husbands. Reductions in the rate of initial arousal were larger for TBCT couples than IBCT couples. Reductions in duration were larger for IBCT couples than TBCT couples. These findings suggest that both therapies can reduce emotional arousal, but that the two therapies create different kinds of change in emotional arousal.
Collapse
|
27
|
Guha T, Yang Z, Ramakrishna A, Grossman RB, Darren H, Lee S, Narayanan SS. On Quantifying Facial Expression-Related Atypicality of Children with Autism Spectrum Disorder. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2015; 2015:803-807. [PMID: 26705397 DOI: 10.1109/icassp.2015.7178080] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Children with Autism Spectrum Disorder (ASD) are known to have difficulty in producing and perceiving emotional facial expressions. Their expressions are often perceived as atypical by adult observers. This paper focuses on data driven ways to analyze and quantify atypicality in facial expressions of children with ASD. Our objective is to uncover those characteristics of facial gestures that induce the sense of perceived atypicality in observers. Using a carefully collected motion capture database, facial expressions of children with and without ASD are compared within six basic emotion categories employing methods from information theory, time-series modeling and statistical analysis. Our experiments show that children with ASD usually have less complex expression producing mechanisms; the differences in facial dynamics between children with and without ASD primarily come from the eye region. Our study also notes that children with ASD exhibit lower symmetry between left and right regions, and lower variation in motion intensity across facial regions.
Collapse
|
28
|
Kim J, Toutios A, Lee S, Narayanan SS. A kinematic study of critical and non-critical articulators in emotional speech production. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:1411-1429. [PMID: 25786953 DOI: 10.1121/1.4908284] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
This study explores one aspect of the articulatory mechanism that underlies emotional speech production, namely, the behavior of linguistically critical and non-critical articulators in the encoding of emotional information. The hypothesis is that the possible larger kinematic variability in the behavior of non-critical articulators enables revealing underlying emotional expression goal more explicitly than that of the critical articulators; the critical articulators are strictly controlled in service of achieving linguistic goals and exhibit smaller kinematic variability. This hypothesis is examined by kinematic analysis of the movements of critical and non-critical speech articulators gathered using eletromagnetic articulography during spoken expressions of five categorical emotions. Analysis results at the level of consonant-vowel-consonant segments reveal that critical articulators for the consonants show more (less) peripheral articulations during production of the consonant-vowel-consonant syllables for high (low) arousal emotions, while non-critical articulators show less sensitive emotional variation of articulatory position to the linguistic gestures. Analysis results at the individual phonetic targets show that overall, between- and within-emotion variability in articulatory positions is larger for non-critical cases than for critical cases. Finally, the results of simulation experiments suggest that the postural variation of non-critical articulators depending on emotion is significantly associated with the controls of critical articulators.
Collapse
|
29
|
Chaspari T, Tsiartas A, Stein LI, Cermak SA, Narayanan SS. Sparse representation of electrodermal activity with knowledge-driven dictionaries. IEEE Trans Biomed Eng 2015; 62:960-71. [PMID: 25494494 PMCID: PMC4362752 DOI: 10.1109/tbme.2014.2376960] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Biometric sensors and portable devices are being increasingly embedded into our everyday life, creating the need for robust physiological models that efficiently represent, analyze, and interpret the acquired signals. We propose a knowledge-driven method to represent electrodermal activity (EDA), a psychophysiological signal linked to stress, affect, and cognitive processing. We build EDA-specific dictionaries that accurately model both the slow varying tonic part and the signal fluctuations, called skin conductance responses (SCR), and use greedy sparse representation techniques to decompose the signal into a small number of atoms from the dictionary. Quantitative evaluation of our method considers signal reconstruction, compression rate, and information retrieval measures, that capture the ability of the model to incorporate the main signal characteristics, such as SCR occurrences. Compared to previous studies fitting a predetermined structure to the signal, results indicate that our approach provides benefits across all aforementioned criteria. This paper demonstrates the ability of appropriate dictionaries along with sparse decomposition methods to reliably represent EDA signals and provides a foundation for automatic measurement of SCR characteristics and the extraction of meaningful EDA features.
Collapse
|
30
|
Kim J, Kumar N, Tsiartas A, Li M, Narayanan SS. Automatic intelligibility classification of sentence-level pathological speech. COMPUT SPEECH LANG 2015; 29:132-144. [PMID: 25414544 DOI: 10.1016/j.csl.2014.02.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automatic evaluation of speech intelligibility and quality could come in handy in these scenarios to assist experts in diagnosis and treatment design, the many sources and types of variability often make it a very challenging computational processing problem. In this work we propose novel sentence-level features to capture abnormal variation in the prosodic, voice quality and pronunciation aspects in pathological speech. In addition, we propose a post-classification posterior smoothing scheme which refines the posterior of a test sample based on the posteriors of other test samples. Finally, we perform feature-level fusions and subsystem decision fusion for arriving at a final intelligibility decision. The performances are tested on two pathological speech datasets, the NKI CCRT Speech Corpus (advanced head and neck cancer) and the TORGO database (cerebral palsy or amyotrophic lateral sclerosis), by evaluating classification accuracy without overlapping subjects' data among training and test partitions. Results show that the feature sets of each of the voice quality subsystem, prosodic subsystem, and pronunciation subsystem, offer significant discriminating power for binary intelligibility classification. We observe that the proposed posterior smoothing in the acoustic space can further reduce classification errors. The smoothed posterior score fusion of subsystems shows the best classification performance (73.5% for unweighted, and 72.8% for weighted, average recalls of the binary classes).
Collapse
|
31
|
Can D, Gibson J, Vaz C, Georgiou PG, Narayanan SS. Barista: A Framework for Concurrent Speech Processing by USC-SAIL. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2014; 2014:3306-3310. [PMID: 27610047 DOI: 10.1109/icassp.2014.6854212] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0.
Collapse
|
32
|
Lee CC, Katsamanis A, Black MP, Baucom BR, Christensen A, Georgiou PG, Narayanan SS. Computing vocal entrainment: A signal-derived PCA-based quantification scheme with application to affect analysis in married couple interactions. COMPUT SPEECH LANG 2014. [DOI: 10.1016/j.csl.2012.06.006] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
33
|
Bone D, Li M, Black MP, Narayanan SS. Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hierarchical Functionals and GMM Supervectors. COMPUT SPEECH LANG 2014; 28. [PMID: 24376305 DOI: 10.1016/j.csl.2012.09.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Segmental and suprasegmental speech signal modulations offer information about paralinguistic content such as affect, age and gender, pathology, and speaker state. Speaker state encompasses medium-term, temporary physiological phenomena influenced by internal or external biochemical actions (e.g., sleepiness, alcohol intoxication). Perceptual and computational research indicates that detecting speaker state from speech is a challenging task. In this paper, we present a system constructed with multiple representations of prosodic and spectral features that provided the best result at the Intoxication Subchallenge of Interspeech 2011 on the Alcohol Language Corpus. We discuss the details of each classifier and show that fusion improves performance. We additionally address the question of how best to construct a speaker state detection system in terms of robust and practical marginalization of associated variability such as through modeling speakers, utterance type, gender, and utterance length. As is the case in human perception, speaker normalization provides significant improvements to our system. We show that a held-out set of baseline (sober) data can be used to achieve comparable gains to other speaker normalization techniques. Our fused frame-level statistic-functional systems, fused GMM systems, and final combined system achieve unweighted average recalls (UARs) of 69.7%, 65.1%, and 68.8%, respectively, on the test set. More consistent numbers compared to development set results occur with matched-prompt training, where the UARs are 70.4%, 66.2%, and 71.4%, respectively. The combined system improves over the Challenge baseline by 5.5% absolute (8.4% relative), also improving upon our previously best result.
Collapse
|
34
|
Zu Y, Narayanan SS, Kim YC, Nayak K, Bronson-Lowe C, Villegas B, Ouyoung M, Sinha UK. Evaluation of swallow function after tongue cancer treatment using real-time magnetic resonance imaging: a pilot study. JAMA Otolaryngol Head Neck Surg 2014; 139:1312-9. [PMID: 24177574 DOI: 10.1001/jamaoto.2013.5444] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
IMPORTANCE Magnetic resonance imaging (MRI) has the advantage of imaging swallow function at any anatomical level without changing the position of patient, which can provide detailed information than modified barium swallow, by far the gold standard of swallow evaluation. OBJECTIVE To investigate the use of real-time MRI in the evaluation of swallow function of patients with tongue cancer. DESIGN, SETTING, AND PARTICIPANTS Real-time MRI experiments were performed on a Signa Excite HD 1.5-T scanner (GE Healthcare), with gradients capable of 40-mT/m (milli-Tesla per meter) amplitudes and 150-mT/m/ms (mT/m per millisecond) slew rates. The sequence used was spiral fast gradient echo sequence. Four men with base of tongue or oral tongue squamous cell carcinoma and 3 age-matched healthy men with normal swallowing participated in the experiment. INTERVENTIONS Real-time MRI of the midsagittal plane was collected during swallowing. Coronal planes between the oral tongue and base of tongue and through the middle of the larynx were collected from 1 of the patients. MAIN OUTCOMES AND MEASURES Oral transit time, pharyngeal transit time, submental muscle length change, and the distance change between the hyoid bone and anterior boundary of the thyroid cartilage were measured frame by frame during swallowing. RESULTS All the measurable oral transit and pharyngeal transit times of the patients with cancer were significantly longer than the ones of the healthy participants. The changes in submental muscle length and the distance between the hyoid bone and thyroid cartilage happened in concert for all 60 normal swallows; however, the pattern differed for each patient with cancer. To our knowledge, the coronal view of the tongue and larynx revealed information that has not been previously reported. CONCLUSIONS AND RELEVANCE This study has demonstrated the potential of real-time MRI to reveal critical information beyond the capacity of traditional videofluoroscopy. Further investigation is needed to fully consider the technique, procedure, and standard scope of applying MRI to evaluate swallow function of patients with cancer in research and clinic practice.
Collapse
|
35
|
Kim J, Lammert AC, Ghosh PK, Narayanan SS. Co-registration of speech production datasets from electromagnetic articulography and real-time magnetic resonance imaging. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2014; 135:EL115-21. [PMID: 25234914 PMCID: PMC3985906 DOI: 10.1121/1.4862880] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
This paper describes a spatio-temporal registration approach for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics.
Collapse
|
36
|
Ramanarayanan V, Goldstein L, Narayanan SS. Spatio-temporal articulatory movement primitives during speech production: extraction, interpretation, and validation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:1378-1394. [PMID: 23927134 PMCID: PMC3745549 DOI: 10.1121/1.4812765] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Revised: 04/12/2013] [Accepted: 06/12/2013] [Indexed: 05/28/2023]
Abstract
This paper presents a computational approach to derive interpretable movement primitives from speech articulation data. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. This is done using pseudo ground-truth primitives generated by the articulatory synthesizer based on an Articulatory Phonology frame-work [Browman and Goldstein (1995). "Dynamics and articulatory phonology," in Mind as motion: Explorations in the dynamics of cognition, edited by R. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175-194]. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. Such a framework might aid the understanding of longstanding issues in speech production such as motor control and coarticulation.
Collapse
|
37
|
Ghosh PK, Narayanan SS. On smoothing articulatory trajectories obtained from Gaussian mixture model based acoustic-to-articulatory inversion. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:EL258-EL264. [PMID: 23927234 PMCID: PMC4109078 DOI: 10.1121/1.4813590] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2013] [Accepted: 06/27/2013] [Indexed: 06/02/2023]
Abstract
It is well-known that the performance of acoustic-to-articulatory inversion improves by smoothing the articulatory trajectories estimated using Gaussian mixture model (GMM) mapping (denoted by GMM + Smoothing). GMM + Smoothing also provides similar performance with GMM mapping using dynamic features, which integrates smoothing directly in the mapping criterion. Due to the separation between smoothing and mapping, what objective criterion GMM + Smoothing optimizes remains unclear. In this work a new integrated smoothness criterion, the smoothed-GMM (SGMM), is proposed. GMM + Smoothing is shown, both analytically and experimentally, to be identical to the asymptotic solution of SGMM suggesting GMM + Smoothing to be a near optimal solution of SGMM.
Collapse
|
38
|
Ramanarayanan V, Goldstein L, Byrd D, Narayanan SS. An investigation of articulatory setting using real-time magnetic resonance imaging. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:510-9. [PMID: 23862826 PMCID: PMC3724797 DOI: 10.1121/1.4807639] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This paper presents an automatic procedure to analyze articulatory setting in speech production using real-time magnetic resonance imaging of the moving human vocal tract. The procedure extracts frames corresponding to inter-speech pauses, speech-ready intervals and absolute rest intervals from magnetic resonance imaging sequences of read and spontaneous speech elicited from five healthy speakers of American English and uses automatically extracted image features to quantify vocal tract posture during these intervals. Statistical analyses show significant differences between vocal tract postures adopted during inter-speech pauses and those at absolute rest before speech; the latter also exhibits a greater variability in the adopted postures. In addition, the articulatory settings adopted during inter-speech pauses in read and spontaneous speech are distinct. The results suggest that adopted vocal tract postures differ on average during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism.
Collapse
|
39
|
Zhu Y, Kim YC, Proctor MI, Narayanan SS, Nayak KS. Dynamic 3-D visualization of vocal tract shaping during speech. IEEE TRANSACTIONS ON MEDICAL IMAGING 2013; 32:838-848. [PMID: 23204279 PMCID: PMC3896513 DOI: 10.1109/tmi.2012.2230017] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Noninvasive imaging is widely used in speech research as a means to investigate the shaping and dynamics of the vocal tract during speech production. 3-D dynamic MRI would be a major advance, as it would provide 3-D dynamic visualization of the entire vocal tract. We present a novel method for the creation of 3-D dynamic movies of vocal tract shaping based on the acquisition of 2-D dynamic data from parallel slices and temporal alignment of the image sequences using audio information. Multiple sagittal 2-D real-time movies with synchronized audio recordings are acquired for English vowel-consonant-vowel stimuli /ala/, /a.ιa/, /asa/, and /a∫a/. Audio data are aligned using mel-frequency cepstral coefficients (MFCC) extracted from windowed intervals of the speech signal. Sagittal image sequences acquired from all slices are then aligned using dynamic time warping (DTW). The aligned image sequences enable dynamic 3-D visualization by creating synthesized movies of the moving airway in the coronal planes, visualizing desired tissue surfaces and tube-shaped vocal tract airway after manual segmentation of targeted articulators and smoothing. The resulting volumes allow for dynamic 3-D visualization of salient aspects of lingual articulation, including the formation of tongue grooves and sublingual cavities, with a temporal resolution of 78 ms.
Collapse
|
40
|
Ettelaie E, Georgiou PG, Narayanan SS. Unsupervised data processing for classifier-based speech translator. COMPUT SPEECH LANG 2013. [DOI: 10.1016/j.csl.2012.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
41
|
Xiao B, Can D, Georgiou PG, Atkins D, Narayanan SS. Analyzing the Language of Therapist Empathy in Motivational Interview based Psychotherapy. SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), ... ASIA-PACIFIC. ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE 2012; 2012:6411762. [PMID: 27602411 PMCID: PMC5010859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Empathy is an important aspect of social communication, especially in medical and psychotherapy applications. Measures of empathy can offer insights into the quality of therapy. We use an N-gram language model based maximum likelihood strategy to classify empathic versus non-empathic utterances and report the precision and recall of classification for various parameters. High recall is obtained with unigram while bigram features achieved the highest F1-score. Based on the utterance level models, a group of lexical features are extracted at the therapy session level. The effectiveness of these features in modeling session level annotator perceptions of empathy is evaluated through correlation with expert-coded session level empathy scores. Our combined feature set achieved a correlation of 0.558 between predicted and expert-coded empathy scores. Results also suggest that the longer term empathy perception process may be more related to isolated empathic salient events.
Collapse
|
42
|
Kim YC, Proctor MI, Narayanan SS, Nayak KS. Improved imaging of lingual articulation using real-time multislice MRI. J Magn Reson Imaging 2011; 35:943-8. [PMID: 22127935 DOI: 10.1002/jmri.23510] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 10/24/2011] [Indexed: 11/09/2022] Open
Abstract
PURPOSE To develop a real-time imaging technique that allows for simultaneous visualization of vocal tract shaping in multiple scan planes, and provides dynamic visualization of complex articulatory features. MATERIALS AND METHODS Simultaneous imaging of multiple slices was implemented using a custom real-time imaging platform. Midsagittal, coronal, and axial scan planes of the human upper airway were prescribed and imaged in real-time using a fast spiral gradient-echo pulse sequence. Two native speakers of English produced voiceless and voiced fricatives /f/-/v/, /θ/-/ð/, /s/-/z/, /∫/- in symmetrical maximally contrastive vocalic contexts /a_a/, /i_i/, and /u_u/. Vocal tract videos were synchronized with noise-cancelled audio recordings, facilitating the selection of frames associated with production of English fricatives. RESULTS Coronal slices intersecting the postalveolar region of the vocal tract revealed tongue grooving to be most pronounced during fricative production in back vowel contexts, and more pronounced for sibilants /s/-/z/ than for /∫/-. The axial slice best revealed differences in dorsal and pharyngeal articulation; voiced fricatives were observed to be produced with a larger cross-sectional area in the pharyngeal airway. Partial saturation of spins provided accurate location of imaging planes with respect to each other. CONCLUSION Real-time MRI of multiple intersecting slices can provide valuable spatial and temporal information about vocal tract shaping, including details not observable from a single slice.
Collapse
|
43
|
Ghosh PK, Goldstein LM, Narayanan SS. Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2011; 129:4014-4022. [PMID: 21682422 PMCID: PMC3135153 DOI: 10.1121/1.3573987] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2010] [Revised: 02/19/2011] [Accepted: 03/13/2011] [Indexed: 05/30/2023]
Abstract
Understanding how the human speech production system is related to the human auditory system has been a perennial subject of inquiry. To investigate the production-perception link, in this paper, a computational analysis has been performed using the articulatory movement data obtained during speech production with concurrently recorded acoustic speech signals from multiple subjects in three different languages: English, Cantonese, and Georgian. The form of articulatory gestures during speech production varies across languages, and this variation is considered to be reflected in the articulatory position and kinematics. The auditory processing of the acoustic speech signal is modeled by a parametric representation of the cochlear filterbank which allows for realizing various candidate filterbank structures by changing the parameter value. Using mathematical communication theory, it is found that the uncertainty about the articulatory gestures in each language is maximally reduced when the acoustic speech signal is represented using the output of a filterbank similar to the empirically established cochlear filterbank in the human auditory system. Possible interpretations of this finding are discussed.
Collapse
|
44
|
Lee CC, Katsamanis A, Black MP, Baucom BR, Georgiou PG, Narayanan SS. Affective State Recognition in Married Couples’ Interactions Using PCA-Based Vocal Entrainment Measures with Multiple Instance Learning. AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION 2011. [DOI: 10.1007/978-3-642-24571-8_4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
45
|
Kim YC, Hayes CE, Narayanan SS, Nayak KS. Novel 16-channel receive coil array for accelerated upper airway MRI at 3 Tesla. Magn Reson Med 2010; 65:1711-7. [PMID: 21590804 DOI: 10.1002/mrm.22742] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 11/01/2010] [Accepted: 11/07/2010] [Indexed: 11/09/2022]
Abstract
Upper airway MRI can provide a noninvasive assessment of speech and swallowing disorders and sleep apnea. Recent work has demonstrated the value of high-resolution three-dimensional imaging and dynamic two-dimensional imaging and the importance of further improvements in spatio-temporal resolution. The purpose of the study was to describe a novel 16-channel 3 Tesla receive coil that is highly sensitive to the human upper airway and investigate the performance of accelerated upper airway MRI with the coil. In three-dimensional imaging of the upper airway during static posture, 6-fold acceleration is demonstrated using parallel imaging, potentially leading to capturing a whole three-dimensional vocal tract with 1.25 mm isotropic resolution within 9 sec of sustained sound production. Midsagittal spiral parallel imaging of vocal tract dynamics during natural speech production is demonstrated with 2 × 2 mm(2) in-plane spatial and 84 ms temporal resolution.
Collapse
|
46
|
Kim YC, Narayanan SS, Nayak KS. Flexible retrospective selection of temporal resolution in real-time speech MRI using a golden-ratio spiral view order. Magn Reson Med 2010; 65:1365-71. [PMID: 21500262 DOI: 10.1002/mrm.22714] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2010] [Revised: 10/08/2010] [Accepted: 10/12/2010] [Indexed: 11/09/2022]
Abstract
In speech production research using real-time magnetic resonance imaging (MRI), the analysis of articulatory dynamics is performed retrospectively. A flexible selection of temporal resolution is highly desirable because of natural variations in speech rate and variations in the speed of different articulators. The purpose of the study is to demonstrate a first application of golden-ratio spiral temporal view order to real-time speech MRI and investigate its performance by comparison with conventional bit-reversed temporal view order. Golden-ratio view order proved to be more effective at capturing the dynamics of rapid tongue tip motion. A method for automated blockwise selection of temporal resolution is presented that enables the synthesis of a single video from multiple temporal resolution videos and potentially facilitates subsequent vocal tract shape analysis.
Collapse
|
47
|
Silva J, Narayanan SS. Information divergence estimation based on data-dependent partitions. J Stat Plan Inference 2010. [DOI: 10.1016/j.jspi.2010.04.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
48
|
Ghosh PK, Narayanan SS. Bark frequency transform using an arbitrary order allpass filter. IEEE SIGNAL PROCESSING LETTERS 2010; 17:543-546. [PMID: 24436628 PMCID: PMC3891208 DOI: 10.1109/lsp.2010.2046192] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
We propose an arbitrary order stable allpass filter structure for frequency transformation from Hertz to Bark scale. According to the proposed filter structure, the first order allpass filter is causal, but the second and higher order allpass filters are non-causal. We find that the accuracy of the transformation significantly improves when a second or higher order allpass filter is designed compared to a first order allpass filter. We also find that the RMS error of the transformation monotonically decreases by increasing the order of the allpass filter.
Collapse
|
49
|
Ramanarayanan V, Bresch E, Byrd D, Goldstein L, Narayanan SS. Analysis of pausing behavior in spontaneous speech using real-time magnetic resonance imaging of articulation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2009; 126:EL160-5. [PMID: 19894792 PMCID: PMC2776778 DOI: 10.1121/1.3213452] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2009] [Accepted: 07/30/2009] [Indexed: 05/22/2023]
Abstract
It is hypothesized that pauses at major syntactic boundaries (i.e., grammatical pauses), but not ungrammatical (e.g., word search) pauses, are planned by a high-level cognitive mechanism that also controls the rate of articulation around these junctures. Real-time magnetic resonance imaging is used to analyze articulation at and around grammatical and ungrammatical pauses in spontaneous speech. Measures quantifying the speed of articulators were developed and applied during these pauses as well as during their immediate neighborhoods. Grammatical pauses were found to have an appreciable drop in speed at the pause itself as compared to ungrammatical pauses, which is consistent with our hypothesis that grammatical pauses are indeed choreographed by a central cognitive planner.
Collapse
|
50
|
Ghosh PK, Narayanan SS. Pitch contour stylization using an optimal piecewise polynomial approximation. IEEE SIGNAL PROCESSING LETTERS 2009; 16:810-813. [PMID: 24453471 PMCID: PMC3895368 DOI: 10.1109/lsp.2009.2025824] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
We propose a dynamic programming (DP) based piecewise polynomial approximation of discrete data such that the L2 norm of the approximation error is minimized. We apply this technique for the stylization of speech pitch contour. Objective evaluation verifies that the DP based technique indeed yields minimum mean square error (MSE) compared to other approximation methods. Subjective evaluation reveals that the quality of the synthesized speech using stylized pitch contour obtained by the DP method is almost identical to that of the original speech.
Collapse
|