1
|
Puffay C, Vanthornhout J, Gillis M, Clercq PD, Accou B, Hamme HV, Francart T. Classifying coherent versus nonsense speech perception from EEG using linguistic speech features. Sci Rep 2024; 14:18922. [PMID: 39143297 PMCID: PMC11324895 DOI: 10.1038/s41598-024-69568-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Accepted: 08/06/2024] [Indexed: 08/16/2024] Open
Abstract
When a person listens to natural speech, the relation between features of the speech signal and the corresponding evoked electroencephalogram (EEG) is indicative of neural processing of the speech signal. Using linguistic representations of speech, we investigate the differences in neural processing between speech in a native and foreign language that is not understood. We conducted experiments using three stimuli: a comprehensible language, an incomprehensible language, and randomly shuffled words from a comprehensible language, while recording the EEG signal of native Dutch-speaking participants. We modeled the neural tracking of linguistic features of the speech signals using a deep-learning model in a match-mismatch task that relates EEG signals to speech, while accounting for lexical segmentation features reflecting acoustic processing. The deep learning model effectively classifies coherent versus nonsense languages. We also observed significant differences in tracking patterns between comprehensible and incomprehensible speech stimuli within the same language. It demonstrates the potential of deep learning frameworks in measuring speech understanding objectively.
Collapse
Affiliation(s)
- Corentin Puffay
- Department Neurosciences, KU Leuven, ExpORL, Leuven, Belgium.
- Department of Electrical engineering (ESAT), KU Leuven, PSI, Leuven, Belgium.
| | | | - Marlies Gillis
- Department Neurosciences, KU Leuven, ExpORL, Leuven, Belgium
| | | | - Bernd Accou
- Department Neurosciences, KU Leuven, ExpORL, Leuven, Belgium
- Department of Electrical engineering (ESAT), KU Leuven, PSI, Leuven, Belgium
| | - Hugo Van Hamme
- Department of Electrical engineering (ESAT), KU Leuven, PSI, Leuven, Belgium
| | - Tom Francart
- Department Neurosciences, KU Leuven, ExpORL, Leuven, Belgium.
| |
Collapse
|
2
|
Roebben A, Heintz N, Geirnaert S, Francart T, Bertrand A. 'Are you even listening?' - EEG-based decoding of absolute auditory attention to natural speech. J Neural Eng 2024; 21:036046. [PMID: 38834062 DOI: 10.1088/1741-2552/ad5403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 06/04/2024] [Indexed: 06/06/2024]
Abstract
Objective.In this study, we use electroencephalography (EEG) recordings to determine whether a subject is actively listening to a presented speech stimulus. More precisely, we aim to discriminate between an active listening condition, and a distractor condition where subjects focus on an unrelated distractor task while being exposed to a speech stimulus. We refer to this task as absolute auditory attention decoding.Approach.We re-use an existing EEG dataset where the subjects watch a silent movie as a distractor condition, and introduce a new dataset with two distractor conditions (silently reading a text and performing arithmetic exercises). We focus on two EEG features, namely neural envelope tracking (NET) and spectral entropy (SE). Additionally, we investigate whether the detection of such an active listening condition can be combined with a selective auditory attention decoding (sAAD) task, where the goal is to decide to which of multiple competing speakers the subject is attending. The latter is a key task in so-called neuro-steered hearing devices that aim to suppress unattended audio, while preserving the attended speaker.Main results.Contrary to a previous hypothesis of higher SE being related with actively listening rather than passively listening (without any distractors), we find significantly lower SE in the active listening condition compared to the distractor conditions. Nevertheless, the NET is consistently significantly higher when actively listening. Similarly, we show that the accuracy of a sAAD task improves when evaluating the accuracy only on the highest NET segments. However, the reverse is observed when evaluating the accuracy only on the lowest SE segments.Significance.We conclude that the NET is more reliable for decoding absolute auditory attention as it is consistently higher when actively listening, whereas the relation of the SE between actively and passively listening seems to depend on the nature of the distractor.
Collapse
Affiliation(s)
- Arnout Roebben
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium
| | - Nicolas Heintz
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium
- KU Leuven, Department of Neurosciences, Experimental Oto-Rhino-Laryngology (ExpORL), Leuven, Belgium
- Leuven.AI-KU Leuven institute for AI, Leuven, Belgium
| | - Simon Geirnaert
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium
- KU Leuven, Department of Neurosciences, Experimental Oto-Rhino-Laryngology (ExpORL), Leuven, Belgium
- Leuven.AI-KU Leuven institute for AI, Leuven, Belgium
| | - Tom Francart
- KU Leuven, Department of Neurosciences, Experimental Oto-Rhino-Laryngology (ExpORL), Leuven, Belgium
- Leuven.AI-KU Leuven institute for AI, Leuven, Belgium
| | - Alexander Bertrand
- KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Leuven, Belgium
- Leuven.AI-KU Leuven institute for AI, Leuven, Belgium
| |
Collapse
|
3
|
Yao Y, Stebner A, Tuytelaars T, Geirnaert S, Bertrand A. Identifying temporal correlations between natural single-shot videos and EEG signals. J Neural Eng 2024; 21:016018. [PMID: 38277701 DOI: 10.1088/1741-2552/ad2333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 01/26/2024] [Indexed: 01/28/2024]
Abstract
Objective.Electroencephalography (EEG) is a widely used technology for recording brain activity in brain-computer interface (BCI) research, where understanding the encoding-decoding relationship between stimuli and neural responses is a fundamental challenge. Recently, there is a growing interest in encoding-decoding natural stimuli in a single-trial setting, as opposed to traditional BCI literature where multi-trial presentations of synthetic stimuli are commonplace. While EEG responses to natural speech have been extensively studied, such stimulus-following EEG responses to natural video footage remain underexplored.Approach.We collect a new EEG dataset with subjects passively viewing a film clip and extract a few video features that have been found to be temporally correlated with EEG signals. However, our analysis reveals that these correlations are mainly driven by shot cuts in the video. To avoid the confounds related to shot cuts, we construct another EEG dataset with natural single-shot videos as stimuli and propose a new set of object-based features.Main results.We demonstrate that previous video features lack robustness in capturing the coupling with EEG signals in the absence of shot cuts, and that the proposed object-based features exhibit significantly higher correlations. Furthermore, we show that the correlations obtained with these proposed features are not dominantly driven by eye movements. Additionally, we quantitatively verify the superiority of the proposed features in a match-mismatch task. Finally, we evaluate to what extent these proposed features explain the variance in coherent stimulus responses across subjects.Significance.This work provides valuable insights into feature design for video-EEG analysis and paves the way for applications such as visual attention decoding.
Collapse
Affiliation(s)
- Yuanyuan Yao
- Department of Electrical Engineering, STADIUS, KU Leuven, Leuven, Belgium
| | - Axel Stebner
- Department of Electrical Engineering, PSI, KU Leuven, Leuven, Belgium
| | - Tinne Tuytelaars
- Department of Electrical Engineering, PSI, KU Leuven, Leuven, Belgium
| | - Simon Geirnaert
- Department of Electrical Engineering, STADIUS, Department of Neurosciences, ExpORL, KU Leuven, Leuven, Belgium
| | - Alexander Bertrand
- Department of Electrical Engineering, STADIUS, KU Leuven, Leuven, Belgium
| |
Collapse
|
4
|
Puffay C, Vanthornhout J, Gillis M, Accou B, Van Hamme H, Francart T. Robust neural tracking of linguistic speech representations using a convolutional neural network. J Neural Eng 2023; 20:046040. [PMID: 37595606 DOI: 10.1088/1741-2552/acf1ce] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 08/18/2023] [Indexed: 08/20/2023]
Abstract
Objective.When listening to continuous speech, populations of neurons in the brain track different features of the signal. Neural tracking can be measured by relating the electroencephalography (EEG) and the speech signal. Recent studies have shown a significant contribution of linguistic features over acoustic neural tracking using linear models. However, linear models cannot model the nonlinear dynamics of the brain. To overcome this, we use a convolutional neural network (CNN) that relates EEG to linguistic features using phoneme or word onsets as a control and has the capacity to model non-linear relations.Approach.We integrate phoneme- and word-based linguistic features (phoneme surprisal, cohort entropy (CE), word surprisal (WS) and word frequency (WF)) in our nonlinear CNN model and investigate if they carry additional information on top of lexical features (phoneme and word onsets). We then compare the performance of our nonlinear CNN with that of a linear encoder and a linearized CNN.Main results.For the non-linear CNN, we found a significant contribution of CE over phoneme onsets and of WS and WF over word onsets. Moreover, the non-linear CNN outperformed the linear baselines.Significance.Measuring coding of linguistic features in the brain is important for auditory neuroscience research and applications that involve objectively measuring speech understanding. With linear models, this is measurable, but the effects are very small. The proposed non-linear CNN model yields larger differences between linguistic and lexical models and, therefore, could show effects that would otherwise be unmeasurable and may, in the future, lead to improved within-subject measures and shorter recordings.
Collapse
Affiliation(s)
- Corentin Puffay
- Department Neurosciences, ExpORL, KU Leuven, Leuven, Belgium
- Department of Electrical engineering (ESAT), PSI, KU Leuven, Leuven, Belgium
| | | | - Marlies Gillis
- Department Neurosciences, ExpORL, KU Leuven, Leuven, Belgium
| | - Bernd Accou
- Department Neurosciences, ExpORL, KU Leuven, Leuven, Belgium
- Department of Electrical engineering (ESAT), PSI, KU Leuven, Leuven, Belgium
| | - Hugo Van Hamme
- Department of Electrical engineering (ESAT), PSI, KU Leuven, Leuven, Belgium
| | - Tom Francart
- Department Neurosciences, ExpORL, KU Leuven, Leuven, Belgium
| |
Collapse
|
5
|
Gillis M, Van Canneyt J, Francart T, Vanthornhout J. Neural tracking as a diagnostic tool to assess the auditory pathway. Hear Res 2022; 426:108607. [PMID: 36137861 DOI: 10.1016/j.heares.2022.108607] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Revised: 08/11/2022] [Accepted: 09/12/2022] [Indexed: 11/20/2022]
Abstract
When a person listens to sound, the brain time-locks to specific aspects of the sound. This is called neural tracking and it can be investigated by analysing neural responses (e.g., measured by electroencephalography) to continuous natural speech. Measures of neural tracking allow for an objective investigation of a range of auditory and linguistic processes in the brain during natural speech perception. This approach is more ecologically valid than traditional auditory evoked responses and has great potential for research and clinical applications. This article reviews the neural tracking framework and highlights three prominent examples of neural tracking analyses: neural tracking of the fundamental frequency of the voice (f0), the speech envelope and linguistic features. Each of these analyses provides a unique point of view into the human brain's hierarchical stages of speech processing. F0-tracking assesses the encoding of fine temporal information in the early stages of the auditory pathway, i.e., from the auditory periphery up to early processing in the primary auditory cortex. Envelope tracking reflects bottom-up and top-down speech-related processes in the auditory cortex and is likely necessary but not sufficient for speech intelligibility. Linguistic feature tracking (e.g. word or phoneme surprisal) relates to neural processes more directly related to speech intelligibility. Together these analyses form a multi-faceted objective assessment of an individual's auditory and linguistic processing.
Collapse
Affiliation(s)
- Marlies Gillis
- Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Belgium.
| | - Jana Van Canneyt
- Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Belgium
| | - Tom Francart
- Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Belgium
| | - Jonas Vanthornhout
- Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Belgium
| |
Collapse
|
6
|
Modulation transfer functions for audiovisual speech. PLoS Comput Biol 2022; 18:e1010273. [PMID: 35852989 PMCID: PMC9295967 DOI: 10.1371/journal.pcbi.1010273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 06/01/2022] [Indexed: 11/19/2022] Open
Abstract
Temporal synchrony between facial motion and acoustic modulations is a hallmark feature of audiovisual speech. The moving face and mouth during natural speech is known to be correlated with low-frequency acoustic envelope fluctuations (below 10 Hz), but the precise rates at which envelope information is synchronized with motion in different parts of the face are less clear. Here, we used regularized canonical correlation analysis (rCCA) to learn speech envelope filters whose outputs correlate with motion in different parts of the speakers face. We leveraged recent advances in video-based 3D facial landmark estimation allowing us to examine statistical envelope-face correlations across a large number of speakers (∼4000). Specifically, rCCA was used to learn modulation transfer functions (MTFs) for the speech envelope that significantly predict correlation with facial motion across different speakers. The AV analysis revealed bandpass speech envelope filters at distinct temporal scales. A first set of MTFs showed peaks around 3-4 Hz and were correlated with mouth movements. A second set of MTFs captured envelope fluctuations in the 1-2 Hz range correlated with more global face and head motion. These two distinctive timescales emerged only as a property of natural AV speech statistics across many speakers. A similar analysis of fewer speakers performing a controlled speech task highlighted only the well-known temporal modulations around 4 Hz correlated with orofacial motion. The different bandpass ranges of AV correlation align notably with the average rates at which syllables (3-4 Hz) and phrases (1-2 Hz) are produced in natural speech. Whereas periodicities at the syllable rate are evident in the envelope spectrum of the speech signal itself, slower 1-2 Hz regularities thus only become prominent when considering crossmodal signal statistics. This may indicate a motor origin of temporal regularities at the timescales of syllables and phrases in natural speech.
Collapse
|
7
|
Accou B, Jalilpour Monesi M, Van Hamme H, Francart T. Predicting speech intelligibility from EEG in a non-linear classification paradigm . J Neural Eng 2021; 18. [PMID: 34706347 DOI: 10.1088/1741-2552/ac33e9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 10/27/2021] [Indexed: 01/05/2023]
Abstract
Objective.Currently, only behavioral speech understanding tests are available, which require active participation of the person being tested. As this is infeasible for certain populations, an objective measure of speech intelligibility is required. Recently, brain imaging data has been used to establish a relationship between stimulus and brain response. Linear models have been successfully linked to speech intelligibility but require per-subject training. We present a deep-learning-based model incorporating dilated convolutions that operates in a match/mismatch paradigm. The accuracy of the model's match/mismatch predictions can be used as a proxy for speech intelligibility without subject-specific (re)training.Approach.We evaluated the performance of the model as a function of input segment length, electroencephalography (EEG) frequency band and receptive field size while comparing it to multiple baseline models. Next, we evaluated performance on held-out data and finetuning. Finally, we established a link between the accuracy of our model and the state-of-the-art behavioral MATRIX test.Main results.The dilated convolutional model significantly outperformed the baseline models for every input segment length, for all EEG frequency bands except the delta and theta band, and receptive field sizes between 250 and 500 ms. Additionally, finetuning significantly increased the accuracy on a held-out dataset. Finally, a significant correlation (r= 0.59,p= 0.0154) was found between the speech reception threshold (SRT) estimated using the behavioral MATRIX test and our objective method.Significance.Our method is the first to predict the SRT from EEG for unseen subjects, contributing to objective measures of speech intelligibility.
Collapse
Affiliation(s)
- Bernd Accou
- Department of Neuroscience and Department of Electrical Engineering, KU Leuven, Leuven, Vlaams Brabant, 3000, Belgium
| | - Mohammad Jalilpour Monesi
- Department of Neuroscience and Department of Electrical Engineering, KU Leuven, Leuven, Vlaams Brabant, 3000, Belgium
| | - Hugo Van Hamme
- Department of Electrical Engineering, KU Leuven, Leuven, Vlaams Brabant, 3000, Belgium
| | - Tom Francart
- Department of Neuroscience, KU Leuven, Leuven, Vlaams Brabant, 3000, Belgium
| |
Collapse
|