1
|
Chen J, Chen X, Wang R, Le C, Khalilian-Gourtani A, Jensen E, Dugan P, Doyle W, Devinsky O, Friedman D, Flinker A, Wang Y. Subject-Agnostic Transformer-Based Neural Speech Decoding from Surface and Depth Electrode Signals. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.11.584533. [PMID: 38559163 PMCID: PMC10980022 DOI: 10.1101/2024.03.11.584533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Objective This study investigates speech decoding from neural signals captured by intracranial electrodes. Most prior works can only work with electrodes on a 2D grid (i.e., Electrocorticographic or ECoG array) and data from a single patient. We aim to design a deep-learning model architecture that can accommodate both surface (ECoG) and depth (stereotactic EEG or sEEG) electrodes. The architecture should allow training on data from multiple participants with large variability in electrode placements and the trained model should perform well on participants unseen during training. Approach We propose a novel transformer-based model architecture named SwinTW that can work with arbitrarily positioned electrodes by leveraging their 3D locations on the cortex rather than their positions on a 2D grid. We train subject-specific models using data from a single participant and multi-patient models exploiting data from multiple participants. Main Results The subject-specific models using only low-density 8×8 ECoG data achieved high decoding Pearson Correlation Coefficient with ground truth spectrogram (PCC=0.817), over N=43 participants, outperforming our prior convolutional ResNet model and the 3D Swin transformer model. Incorporating additional strip, depth, and grid electrodes available in each participant (N=39) led to further improvement (PCC=0.838). For participants with only sEEG electrodes (N=9), subject-specific models still enjoy comparable performance with an average PCC=0.798. The multi-subject models achieved high performance on unseen participants, with an average PCC=0.765 in leave-one-out cross-validation. Significance The proposed SwinTW decoder enables future speech neuropros-theses to utilize any electrode placement that is clinically optimal or feasible for a particular participant, including using only depth electrodes, which are more routinely implanted in chronic neurosurgical procedures. Importantly, the generalizability of the multi-patient models suggests that such a model can be applied to new patients that do not have paired acoustic and neural data, providing an advance in neuroprostheses for people with speech disability, where acoustic-neural training data is not feasible.
Collapse
|
2
|
Maskeliūnas R, Damaševičius R, Kulikajevas A, Pribuišis K, Uloza V. Alaryngeal Speech Enhancement for Noisy Environments Using a Pareto Denoising Gated LSTM. J Voice 2024:S0892-1997(24)00228-5. [PMID: 39107213 DOI: 10.1016/j.jvoice.2024.07.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 07/13/2024] [Accepted: 07/15/2024] [Indexed: 08/09/2024]
Abstract
Loss of the larynx significantly alters natural voice production, requiring alternative communication modalities and rehabilitation methods to restore speech intelligibility and improve the quality of life of affected individuals. This paper explores advances in alaryngeal speech enhancement to improve signal quality and reduce background noise, focusing on individuals who have undergone laryngectomy. In this study, speech samples were obtained from 23 Lithuanian males who had undergone laryngectomy with secondary implantation of the tracheoesophageal prosthesis (TEP). Pareto-optimized gated long short-term memory was trained on tracheoesophageal speech data to recognize complex temporal connections and contextual information in speech signals. The system was able to distinguish between actual speech and various forms of noise and artifacts, resulting in a 25% drop in the mean signal-to-noise ratio compared to other approaches. According to acoustic analysis, the system significantly decreased the number of unvoiced frames (proportion of voiced frames) from 40% to 10% while maintaining stable proportions of voiced frames (proportion of voiced speech frames) and average voicing evidence (average voice evidence in voiced frames), indicating the accuracy of the approach in selectively attenuating noise and undesired speech artifacts while preserving important speech information.
Collapse
Affiliation(s)
- Rytis Maskeliūnas
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania.
| | - Robertas Damaševičius
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania
| | - Audrius Kulikajevas
- Centre of Real Time Computer Systems, Kaunas University of Technology, Kaunas, Lithuania
| | - Kipras Pribuišis
- Department of Otolaryngology, Lithuanian University of Health Sciences, Kaunas, Lithuania
| | - Virgilijus Uloza
- Department of Otolaryngology, Lithuanian University of Health Sciences, Kaunas, Lithuania
| |
Collapse
|
3
|
Wang R, Chen ZS. Large-scale foundation models and generative AI for BigData neuroscience. Neurosci Res 2024:S0168-0102(24)00075-0. [PMID: 38897235 DOI: 10.1016/j.neures.2024.06.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 04/15/2024] [Accepted: 05/15/2024] [Indexed: 06/21/2024]
Abstract
Recent advances in machine learning have led to revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the landscapes of neuroscience research and make a significant impact on the future. Here we present a mini-review on recent advances in foundation models and generative AI models as well as their applications in neuroscience, including natural language and speech, semantic memory, brain-machine interfaces (BMIs), and data augmentation. We argue that this paradigm-shift framework will open new avenues for many neuroscience research directions and discuss the accompanying challenges and opportunities.
Collapse
Affiliation(s)
- Ran Wang
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
| | - Zhe Sage Chen
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA; Department of Neuroscience and Physiology, Neuroscience Institute, New York University Grossman School of Medicine, New York, NY 10016, USA; Department of Biomedical Engineering, New York University Tandon School of Engineering, Brooklyn, NY 11201, USA.
| |
Collapse
|
4
|
Wu H, Cai C, Ming W, Chen W, Zhu Z, Feng C, Jiang H, Zheng Z, Sawan M, Wang T, Zhu J. Speech decoding using cortical and subcortical electrophysiological signals. Front Neurosci 2024; 18:1345308. [PMID: 38486966 PMCID: PMC10937352 DOI: 10.3389/fnins.2024.1345308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 02/12/2024] [Indexed: 03/17/2024] Open
Abstract
Introduction Language impairments often result from severe neurological disorders, driving the development of neural prosthetics utilizing electrophysiological signals to restore comprehensible language. Previous decoding efforts primarily focused on signals from the cerebral cortex, neglecting subcortical brain structures' potential contributions to speech decoding in brain-computer interfaces. Methods In this study, stereotactic electroencephalography (sEEG) was employed to investigate subcortical structures' role in speech decoding. Two native Mandarin Chinese speakers, undergoing sEEG implantation for epilepsy treatment, participated. Participants read Chinese text, with 1-30, 30-70, and 70-150 Hz frequency band powers of sEEG signals extracted as key features. A deep learning model based on long short-term memory assessed the contribution of different brain structures to speech decoding, predicting consonant articulatory place, manner, and tone within single syllable. Results Cortical signals excelled in articulatory place prediction (86.5% accuracy), while cortical and subcortical signals performed similarly for articulatory manner (51.5% vs. 51.7% accuracy). Subcortical signals provided superior tone prediction (58.3% accuracy). The superior temporal gyrus was consistently relevant in speech decoding for consonants and tone. Combining cortical and subcortical inputs yielded the highest prediction accuracy, especially for tone. Discussion This study underscores the essential roles of both cortical and subcortical structures in different aspects of speech decoding.
Collapse
Affiliation(s)
- Hemmings Wu
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Clinical Research Center for Neurological Disease of Zhejiang Province, Hangzhou, China
| | - Chengwei Cai
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Wenjie Ming
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
- Department of Neurology, Epilepsy Center, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Wangyu Chen
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhoule Zhu
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Chen Feng
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Hongjie Jiang
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhe Zheng
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Mohamad Sawan
- CenBRAIN Lab, School of Engineering, Westlake University, Hangzhou, China
| | - Ting Wang
- School of Foreign Languages, Tongji University, Shanghai, China
- Center for Speech and Language Processing, Tongji University, Shanghai, China
| | - Junming Zhu
- Department of Neurosurgery, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
5
|
He Q, Yang Y, Ge P, Li S, Chai X, Luo Z, Zhao J. The brain nebula: minimally invasive brain-computer interface by endovascular neural recording and stimulation. J Neurointerv Surg 2024:jnis-2023-021296. [PMID: 38388478 DOI: 10.1136/jnis-2023-021296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 01/19/2024] [Indexed: 02/24/2024]
Abstract
A brain-computer interface (BCI) serves as a direct communication channel between brain activity and external devices, typically a computer or robotic limb. Advances in technology have led to the increasing use of intracranial electrical recording or stimulation in the treatment of conditions such as epilepsy, depression, and movement disorders. This indicates that BCIs can offer clinical neurological rehabilitation for patients with disabilities and functional impairments. They also provide a means to restore consciousness and functionality for patients with sequelae from major brain diseases. Whether invasive or non-invasive, the collected cortical or deep signals can be decoded and translated for communication. This review aims to provide an overview of the advantages of endovascular BCIs compared with conventional BCIs, along with insights into the specific anatomical regions under study. Given the rapid progress, we also provide updates on ongoing clinical trials and the prospects for current research involving endovascular electrodes.
Collapse
Affiliation(s)
- Qiheng He
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- Brain Computer Interface Transitional Research Center, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Yi Yang
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- Brain Computer Interface Transitional Research Center, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Center for Neurological Disorders, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
- National Research Center for Rehabilitation Technical Aids, Beijing, China
- Chinese Institute for Brain Research, Beijing, People's Republic of China
- Beijing Institute of Brain Disorders, Beijing, People's Republic of China
| | - Peicong Ge
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Sining Li
- Tianjin Key Laboratory of Brain Science and Intelligent Rehabilitation, College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Xiaoke Chai
- Brain Computer Interface Transitional Research Center, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Zhongqiu Luo
- Department of Neurosurgery, Shenzhen Qianhai Shekou Free Trade Zone Hospital, Shenzhen, China
| | - Jizong Zhao
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
- China National Center for Neurological Disorders, Beijing, China
- China National Clinical Research Center for Neurological Diseases, Beijing, China
| |
Collapse
|
6
|
Khanna AR, Muñoz W, Kim YJ, Kfir Y, Paulk AC, Jamali M, Cai J, Mustroph ML, Caprara I, Hardstone R, Mejdell M, Meszéna D, Zuckerman A, Schweitzer J, Cash S, Williams ZM. Single-neuronal elements of speech production in humans. Nature 2024; 626:603-610. [PMID: 38297120 PMCID: PMC10866697 DOI: 10.1038/s41586-023-06982-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 12/14/2023] [Indexed: 02/02/2024]
Abstract
Humans are capable of generating extraordinarily diverse articulatory movement combinations to produce meaningful speech. This ability to orchestrate specific phonetic sequences, and their syllabification and inflection over subsecond timescales allows us to produce thousands of word sounds and is a core component of language1,2. The fundamental cellular units and constructs by which we plan and produce words during speech, however, remain largely unknown. Here, using acute ultrahigh-density Neuropixels recordings capable of sampling across the cortical column in humans, we discover neurons in the language-dominant prefrontal cortex that encoded detailed information about the phonetic arrangement and composition of planned words during the production of natural speech. These neurons represented the specific order and structure of articulatory events before utterance and reflected the segmentation of phonetic sequences into distinct syllables. They also accurately predicted the phonetic, syllabic and morphological components of upcoming words and showed a temporally ordered dynamic. Collectively, we show how these mixtures of cells are broadly organized along the cortical column and how their activity patterns transition from articulation planning to production. We also demonstrate how these cells reliably track the detailed composition of consonant and vowel sounds during perception and how they distinguish processes specifically related to speaking from those related to listening. Together, these findings reveal a remarkably structured organization and encoding cascade of phonetic representations by prefrontal neurons in humans and demonstrate a cellular process that can support the production of speech.
Collapse
Affiliation(s)
- Arjun R Khanna
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - William Muñoz
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | | | - Yoav Kfir
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Angelique C Paulk
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Mohsen Jamali
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Jing Cai
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Martina L Mustroph
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Irene Caprara
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Richard Hardstone
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Mackenna Mejdell
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Domokos Meszéna
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | | | - Jeffrey Schweitzer
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Sydney Cash
- Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Ziv M Williams
- Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA.
- Harvard Medical School, Program in Neuroscience, Boston, MA, USA.
| |
Collapse
|
7
|
Tsunada J, Eliades SJ. Frontal-Auditory Cortical Interactions and Sensory Prediction During Vocal Production in Marmoset Monkeys. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.28.577656. [PMID: 38352422 PMCID: PMC10862695 DOI: 10.1101/2024.01.28.577656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/21/2024]
Abstract
The control of speech and vocal production involves the calculation of error between the intended vocal output and the resulting auditory feedback. Consistent with this model, recent evidence has demonstrated that the auditory cortex is suppressed immediately before and during vocal production, yet is still sensitive to differences between vocal output and altered auditory feedback. This suppression has been suggested to be the result of top-down signals containing information about the intended vocal output, potentially originating from motor or other frontal cortical areas. However, whether such frontal areas are the source of suppressive and predictive signaling to the auditory cortex during vocalization is unknown. Here, we simultaneously recorded neural activity from both the auditory and frontal cortices of marmoset monkeys while they produced self-initiated vocalizations. We found increases in neural activity in both brain areas preceding the onset of vocal production, notably changes in both multi-unit activity and local field potential theta-band power. Connectivity analysis using Granger causality demonstrated that frontal cortex sends directed signaling to the auditory cortex during this pre-vocal period. Importantly, this pre-vocal activity predicted both vocalization-induced suppression of the auditory cortex as well as the acoustics of subsequent vocalizations. These results suggest that frontal cortical areas communicate with the auditory cortex preceding vocal production, with frontal-auditory signals that may reflect the transmission of sensory prediction information. This interaction between frontal and auditory cortices may contribute to mechanisms that calculate errors between intended and actual vocal outputs during vocal communication.
Collapse
Affiliation(s)
- Joji Tsunada
- Chinese Institute for Brain Research, Beijing, China
- Department of Veterinary Medicine, Faculty of Agriculture, Iwate University, Morioka, Iwate, Japan
| | - Steven J. Eliades
- Department of Head and Neck Surgery & Communication Sciences, Duke University School of Medicine, Durham, NC 27710, USA
| |
Collapse
|