1
|
Lozano A, Nava E, García Méndez MD, Moreno-Torres I. Computing nasalance with MFCCs and Convolutional Neural Networks. PLoS One 2024; 19:e0315452. [PMID: 39739659 DOI: 10.1371/journal.pone.0315452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 11/25/2024] [Indexed: 01/02/2025] Open
Abstract
Nasalance is a valuable clinical biomarker for hypernasality. It is computed as the ratio of acoustic energy emitted through the nose to the total energy emitted through the mouth and nose (eNasalance). A new approach is proposed to compute nasalance using Convolutional Neural Networks (CNNs) trained with Mel-Frequency Cepstrum Coefficients (mfccNasalance). mfccNasalance is evaluated by examining its accuracy: 1) when the train and test data are from the same or from different dialects; 2) with test data that differs in dynamicity (e.g. rapidly produced diadochokinetic syllables versus short words); and 3) using multiple CNN configurations (i.e. kernel shape and use of 1 × 1 pointwise convolution). Dual-channel Nasometer speech data from healthy speakers from different dialects: Costa Rica, more(+) nasal, Spain and Chile, less(-) nasal, are recorded. The input to the CNN models were sequences of 39 MFCC vectors computed from 250 ms moving windows. The test data were recorded in Spain and included short words (-dynamic), sentences (+dynamic), and diadochokinetic syllables (+dynamic). The accuracy of a CNN model was defined as the Spearman correlation between the mfccNasalance for that model and the perceptual nasality scores of human experts. In the same-dialect condition, mfccNasalance was more accurate than eNasalance independently of the CNN configuration; using a 1 × 1 kernel resulted in increased accuracy for +dynamic utterances (p < .000), though not for -dynamic utterances. The kernel shape had a significant impact for -dynamic utterances (p < .000) exclusively. In the different-dialect condition, the scores were significantly less accurate than in the same-dialect condition, particularly for Costa Rica trained models. We conclude that mfccNasalance is a flexible and useful alternative to eNasalance. Future studies should explore how to optimize mfccNasalance by selecting the most adequate CNN model as a function of the dynamicity of the target speech data.
Collapse
Affiliation(s)
- Andrés Lozano
- Department of Communication Engineering, University of Málaga, Málaga, Spain
| | - Enrique Nava
- Department of Communication Engineering, University of Málaga, Málaga, Spain
| | | | | |
Collapse
|
2
|
Cornefjord M, Bluhme J, Jakobsson A, Klintö K, Lohmander A, Mamedov T, Stiernman M, Svensson R, Becker M. Using Artificial Intelligence for Assessment of Velopharyngeal Competence in Children Born With Cleft Palate With or Without Cleft Lip. Cleft Palate Craniofac J 2024:10556656241271646. [PMID: 39150004 DOI: 10.1177/10556656241271646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/17/2024] Open
Abstract
OBJECTIVE Development of an AI tool to assess velopharyngeal competence (VPC) in children with cleft palate, with/without cleft lip. DESIGN Innovation of an AI tool using retrospective audio recordings and assessments of VPC. SETTING Two datasets were used. The first, named the SR dataset, included data from follow-up visits to Skåne University Hospital, Sweden. The second, named the SC + IC dataset, was a combined dataset (SC + IC dataset) with data from the Scandcleft randomized trials across five countries and an intercenter study performed at six Swedish CL/P centers. PARTICIPANTS SR dataset included 153 recordings from 162 children, and SC + IC dataset included 308 recordings from 399 children. All recordings were from ages 5 or 10, with corresponding VPC assessments. INTERVENTIONS Development of two networks, a convolutional neural network (CNN) and a pre-trained CNN (VGGish). After initial testing using the SR dataset, the networks were re-tested using the SC + IC dataset and modified to improve performance. MAIN OUTCOME MEASURES Accuracy of the networks' VPC scores, with speech and language pathologistś scores seen as the true values. A three-point scale was used for VPC assessments. RESULTS VGGish outperformed CNN, achieving 57.1% accuracy compared to 39.8%. Minor adjustments in data pre-processing and network characteristics improved accuracies. CONCLUSIONS Network accuracies were too low for the networks to be useful alternatives for VPC assessment in clinical practice. Suggestions for future research with regards to study design and dataset optimization were discussed.
Collapse
Affiliation(s)
- Måns Cornefjord
- Department of Plastic and Reconstructive Surgery, Skåne University Hospital, Malmö, Sweden
- Department of Clinical Sciences in Malmö, Lund University, Malmö, Sweden
| | - Joel Bluhme
- Centre for Mathematical Sciences, Mathematical Statistics, Lund University, Lund, Sweden
| | - Andreas Jakobsson
- Centre for Mathematical Sciences, Mathematical Statistics, Lund University, Lund, Sweden
| | - Kristina Klintö
- Division of Speech Language Pathology, Department of Otorhinolaryngology, Division of Speech and Language Pathology, Skåne University Hospital, Malmö, Sweden
- Division of Speech Language Pathology, Phoniatrics and Audiology, Department of Clinical Sciences in Lund, Lund University, Lund, Sweden
| | - Anette Lohmander
- Division of Speech & Language Pathology, Department of Clinical Science, Intervention and Technology, CLINTEC, Karolinska Institutet, Stockholm, Sweden
| | - Tofig Mamedov
- Centre for Mathematical Sciences, Mathematical Statistics, Lund University, Lund, Sweden
| | - Mia Stiernman
- Department of Plastic and Reconstructive Surgery, Skåne University Hospital, Malmö, Sweden
- Department of Clinical Sciences in Malmö, Lund University, Malmö, Sweden
| | - Rebecca Svensson
- Centre for Mathematical Sciences, Mathematical Statistics, Lund University, Lund, Sweden
| | - Magnus Becker
- Department of Plastic and Reconstructive Surgery, Skåne University Hospital, Malmö, Sweden
- Department of Clinical Sciences in Malmö, Lund University, Malmö, Sweden
| |
Collapse
|
3
|
Berisha V, Liss JM. Responsible development of clinical speech AI: Bridging the gap between clinical research and technology. NPJ Digit Med 2024; 7:208. [PMID: 39122889 PMCID: PMC11316053 DOI: 10.1038/s41746-024-01199-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 07/19/2024] [Indexed: 08/12/2024] Open
Abstract
This perspective article explores the challenges and potential of using speech as a biomarker in clinical settings, particularly when constrained by the small clinical datasets typically available in such contexts. We contend that by integrating insights from speech science and clinical research, we can reduce sample complexity in clinical speech AI models with the potential to decrease timelines to translation. Most existing models are based on high-dimensional feature representations trained with limited sample sizes and often do not leverage insights from speech science and clinical research. This approach can lead to overfitting, where the models perform exceptionally well on training data but fail to generalize to new, unseen data. Additionally, without incorporating theoretical knowledge, these models may lack interpretability and robustness, making them challenging to troubleshoot or improve post-deployment. We propose a framework for organizing health conditions based on their impact on speech and promote the use of speech analytics in diverse clinical contexts beyond cross-sectional classification. For high-stakes clinical use cases, we advocate for a focus on explainable and individually-validated measures and stress the importance of rigorous validation frameworks and ethical considerations for responsible deployment. Bridging the gap between AI research and clinical speech research presents new opportunities for more efficient translation of speech-based AI tools and advancement of scientific discoveries in this interdisciplinary space, particularly if limited to small or retrospective datasets.
Collapse
Affiliation(s)
- Visar Berisha
- School of Electrical Computer and Energy Engineering and College of Health Solutions, Arizona State University, Tempe, AZ, USA.
| | - Julie M Liss
- College of Health Solutions, Arizona State University, Tempe, AZ, USA
| |
Collapse
|
4
|
Ha JH, Lee H, Kwon SM, Joo H, Lin G, Kim DY, Kim S, Hwang JY, Chung JH, Kong HJ. Deep Learning-Based Diagnostic System for Velopharyngeal Insufficiency Based on Videofluoroscopy in Patients With Repaired Cleft Palates. J Craniofac Surg 2023; 34:2369-2375. [PMID: 37815288 PMCID: PMC10597411 DOI: 10.1097/scs.0000000000009560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 05/16/2023] [Indexed: 10/11/2023] Open
Abstract
Velopharyngeal insufficiency (VPI), which is the incomplete closure of the velopharyngeal valve during speech, is a typical poor outcome that should be evaluated after cleft palate repair. The interpretation of VPI considering both imaging analysis and perceptual evaluation is essential for further management. The authors retrospectively reviewed patients with repaired cleft palates who underwent assessment for velopharyngeal function, including both videofluoroscopic imaging and perceptual speech evaluation. The final diagnosis of VPI was made by plastic surgeons based on both assessment modalities. Deep learning techniques were applied for the diagnosis of VPI and compared with the human experts' diagnostic results of videofluoroscopic imaging. In addition, the results of the deep learning techniques were compared with a speech pathologist's diagnosis of perceptual evaluation to assess consistency with clinical symptoms. A total of 714 cases from January 2010 to June 2019 were reviewed. Six deep learning algorithms (VGGNet, ResNet, Xception, ResNext, DenseNet, and SENet) were trained using the obtained dataset. The area under the receiver operating characteristic curve of the algorithms ranged between 0.8758 and 0.9468 in the hold-out method and between 0.7992 and 0.8574 in the 5-fold cross-validation. Our findings demonstrated the deep learning algorithms performed comparable to experienced plastic surgeons in the diagnosis of VPI based on videofluoroscopic velopharyngeal imaging.
Collapse
Affiliation(s)
- Jeong Hyun Ha
- Department of Plastic and Reconstructive Surgery, Biomedical Research Institute, Seoul National University Hospital
- Interdisciplinary Program of Medical Informatics, Seoul National University College of Medicine, Seoul
| | - Haeyun Lee
- Department of Electrical Engineering and Computer Science, Daegu Gyeongbuk Institute of Science and Technology, Daegu
- Medical Big Data Research Center, Seoul National University College of Medicine, Seoul
- Production Engineering Research Team, SAMSUNG SDI, Yongin-si, Gyeonggi-do Province
| | - Seok Min Kwon
- Department of Plastic and Reconstructive Surgery, Seoul National University College of Medicine
| | - Hyunjin Joo
- Transdisciplinary Department of Medicine & Advanced Technology, Seoul National University Hospital, Seoul, Korea
| | - Guang Lin
- Department of Aesthetic and Plastic Surgery, The First Affiliated Hospital ZHEJIANG University School of Medicine, Hangzhou, China
| | - Deok-Yeol Kim
- Department of Plastic Surgery, CHA Bundang Medical Center, and CHA Institute of Aesthetic Medicine, Seongnam-si, Gyeonggi-do Province
| | - Sukwha Kim
- Medical Big Data Research Center, Seoul National University College of Medicine, Seoul
- Department of Plastic Surgery, CHA Bundang Medical Center, and CHA Institute of Aesthetic Medicine, Seongnam-si, Gyeonggi-do Province
| | - Jae Youn Hwang
- Department of Electrical Engineering and Computer Science, Daegu Gyeongbuk Institute of Science and Technology, Daegu
- Interdisciplinary Studies of Artificial Intelligence, Daegu Gyeongbuk Institute of Science and Technology, Daegu
| | - Jee-Hyeok Chung
- Division of Pediatric Plastic Surgery, Seoul National University Children’s Hospital
| | - Hyoun-Joong Kong
- Medical Big Data Research Center, Seoul National University College of Medicine, Seoul
- Department of Transdisciplinary Medicine, Seoul National University Hospital, Seoul, Republic of Korea
- Department of Medicine, Seoul National University College of Medicine, Seoul, Korea
| |
Collapse
|
5
|
Zhang Y, Zhang J, Li W, Yin H, He L. Automatic Detection System for Velopharyngeal Insufficiency Based on Acoustic Signals from Nasal and Oral Channels. Diagnostics (Basel) 2023; 13:2714. [PMID: 37627973 PMCID: PMC10453249 DOI: 10.3390/diagnostics13162714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 08/18/2023] [Accepted: 08/19/2023] [Indexed: 08/27/2023] Open
Abstract
Velopharyngeal insufficiency (VPI) is a type of pharyngeal function dysfunction that causes speech impairment and swallowing disorder. Speech therapists play a key role on the diagnosis and treatment of speech disorders. However, there is a worldwide shortage of experienced speech therapists. Artificial intelligence-based computer-aided diagnosing technology could be a solution for this. This paper proposes an automatic system for VPI detection at the subject level. It is a non-invasive and convenient approach for VPI diagnosis. Based on the principle of impaired articulation of VPI patients, nasal- and oral-channel acoustic signals are collected as raw data. The system integrates the symptom discriminant results at the phoneme level. For consonants, relative prominent frequency description and relative frequency distribution features are proposed to discriminate nasal air emission caused by VPI. For hypernasality-sensitive vowels, a cross-attention residual Siamese network (CARS-Net) is proposed to perform automatic VPI/non-VPI classification at the phoneme level. CARS-Net embeds a cross-attention module between the two branches to improve the VPI/non-VPI classification model for vowels. We validate the proposed system on a self-built dataset, and the accuracy reaches 98.52%. This provides possibilities for implementing automatic VPI diagnosis.
Collapse
Affiliation(s)
- Yu Zhang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Jing Zhang
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Wen Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| | - Heng Yin
- West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China;
| | - Ling He
- College of Biomedical Engineering, Sichuan University, Chengdu 610065, China; (Y.Z.); (J.Z.); (W.L.)
| |
Collapse
|
6
|
Xu L, Liss J, Berisha V. Dysarthria detection based on a deep learning model with a clinically-interpretable layer. JASA EXPRESS LETTERS 2023; 3:015201. [PMID: 36725533 PMCID: PMC9835557 DOI: 10.1121/10.0016833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 12/20/2022] [Indexed: 06/18/2023]
Abstract
Studies have shown deep neural networks (DNN) as a potential tool for classifying dysarthric speakers and controls. However, representations used to train DNNs are largely not clinically interpretable, which limits clinical value. Here, a model with a bottleneck layer is trained to jointly learn a classification label and four clinically-interpretable features. Evaluation of two dysarthria subtypes shows that the proposed method can flexibly trade-off between improved classification accuracy and discovery of clinically-interpretable deficit patterns. The analysis using Shapley additive explanation shows the model learns a representation consistent with the disturbances that define the two dysarthria subtypes considered in this work.
Collapse
Affiliation(s)
- Lingfeng Xu
- School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona 85281, USA
| | - Julie Liss
- College of Health Solutions, Arizona State University, Tempe, Arizona 85281, USA ; ;
| | - Visar Berisha
- College of Health Solutions, Arizona State University, Tempe, Arizona 85281, USA ; ;
| |
Collapse
|
7
|
The Feasibility of Cross-Linguistic Speech Evaluation in the Care of International Cleft Palate Patients. J Craniofac Surg 2022; 33:1413-1417. [PMID: 35275855 DOI: 10.1097/scs.0000000000008645] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2022] [Accepted: 02/09/2022] [Indexed: 11/25/2022] Open
Abstract
ABSTRACT Many patients with cleft palate in developing countries never receive postoperative speech assessment or therapy. The use of audiovisual recordings could improve access to post-repair speech care. The present study evaluated whether English-speaking speech-language pathologists (SLPs) could assess cleft palate patients speaking an unfamiliar language (Tamil) using recorded media. Recordings obtained from Tamil-speaking participants were rated by 1 Tamil-speaking SLP and 3 English-speaking SLPs. Ratings were analyzed for inter-rater reliability and scored for percent correct. Accuracy of the English SLPs was compared with independent t tests and Analysis of Variance. Sixteen participants (mean age 14.5 years, standard deviation [SD] 7.4 years; mean age of surgery of 2.7 years, SD 3.7 years; time since surgery: 10.8 years, SD 5.7 years) were evaluated. Across the 4 SLPs, 5 speech elements were found to have moderate agreement, and the mean kappa was 0.145 (slight agreement). Amongst the English-speaking SLPs, 10 speech elements were found to have substantial or moderate agreement, and the mean kappa was 0.333 (fair agreement). Speech measures with the highest inter-rater reliability were hypernasality and consonant production errors. The average percent correct of the English SLPs was 60.7% (SD 20.2%). English SLPs were more accurate if the participant was female, under eighteen, bilingual, or had speech therapy. The results demonstrate that English SLPs without training in a specific language (Tamil) have limited potential to assess speech elements accurately. This research could guide training interventions to augment the ability of SLPs to conduct cross-linguistic evaluations and improve international cleft care by global health teams.
Collapse
|