Discovery of a Scaling Law in Deep Learning Using EEG: A Path Towards Practical Non-Invasive Speech BMI
On July 11, 2024, the X Communication Team (Team Lead: Dr. Shuntaro Sasai) of the Research and Development Department at Araya Inc. submitted the paper “Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data” to arXiv. This is an explanatory article of the above paper.
Brain-machine interfaces (BMI) have been developed as supportive technology for those with physical and communication difficulties. Recent findings have shown that brain implants, electrodes implanted inside the skull, enable speech decoding with practical accuracy, attracting significant attention. However, brain implants involve open-brain surgeries, which are highly demanding both physically and psychologically and pose a substantial barrier to broad acceptance in society. In contrast, electroencephalography (EEG) is non-invasive (does not cause physical damage) and easy to equip and carry, thus EEG-based BMI implementation is much anticipated. Despite this, while simple tasks like classifications with a small number of classes seem possible, achieving practical accuracy in tasks with high degrees of freedom is challenging, casting doubt on the feasibility of speech BMI using EEG.
In recent EEG-based decodings, researchers have used several Deep learning models. This is a technology well-known for its application in large language models (LLMs) such as ChatGPT. It is known that the accuracy scales with three elements, namely the amount of training data, model size, and computational resources. This is called the scaling law in deep learning, facilitating the recent trend of large-scale AI development and improvement in various fields. On the other hand, typical EEG decoding takes about 10 hours, significantly smaller than those for leveraging the data size. As such, it remains unknown whether scaling laws exist and whether decoding accuracy could be improved based on them.
One of the underlying challenges was the difficulty in creating large-scale EEG data in the first place. For instance, speech decoding requires continuous accumulation of EEG data during monotonous tasks such as reading text aloud. Performing such tasks daily for a long period has been unbearable and difficult for most subjects to complete. To solve this, our X Communication Team of the Research and Development Department at Araya Inc. recruited a "BMI pilots", who loves and is willing to play games for a long duration of time, as a participant for their experiment. They designed the environment where the pilot could accumulate data effortlessly (or even willingly) by setting the task as a play-by-play narration, where the pilot read aloud the speech of characters in the game (Figure 1). As a result, the team succeeded in collecting more than 400 hours of EEG data during text reading from the same subject.
Figure 1 Sample image of the experiment
Using this large-scale data, the team trained a deep learning model through self-supervised representation learning for the zero-shot speech phrase classification. A model trained using 100 hours of obtained data resulted in achieving 48% accuracy on tasks to choose the correct answer from 512 phrase options. On the other hand, models trained with the typical amount of data used in EEG learning (approximately 10 hours) caused a significant reduction in accuracy from 48% to 2.5% (Figure 2), revealing a significant scaling effect of data amounts in EEG-based deep learning.
Figure 2 Scaling effect by the amount of data
The power law between the data size (in hours) used to train the model and the top 1 accuracy (left), top 10 accuracy (center), and loss (right) on a phrase classification task using test data. The black dashed line indicates the chance level of classification accuracy, and the orange dashed line indicates the best linear fit to the data. The green and red arrows indicate the size of the data used in the traditional EEG language decoding study.
Broderick, M. P., A. J. Anderson, G. M. Di Liberto, et al. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech.Curr. Biol. 28(5):803-809.e3, 2018.Brennan, J. R., J. T. Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening. PLoS One, 14(1):e0207741, 2019.
In addition, not only the content of phrases but also the speech break between words can be more precisely decoded with more training data (Figure 2). This indicates that speech segments can be recognized in a data-driven manner without explicitly labeling individual words within phrases.
Figure 3 Speech interval detection using EEG latent representation
(a) EEG latent representations obtained by the encoder (color map) reflects the state of decoded speech. Intervals with this latent representation varying significantly over time indicate dynamic changes in the speech state, whereas intervals with small variation indicate a consistent state in the speech output. Comparing these intervals with actual speech by the subject revealed that intervals with minimal variation corresponded to periods when the subject was not speaking. Therefore, it was observed that the latent representation during these intervals indicates periods of silence. Further comparison between speaking intervals predicted from the EEG latent representation. Identified intervals from actual speech data achieved a matching rate of 0.88. (b) This matching rate was also confirmed to improve with an increase in the amount of training data.
When reading text aloud, electromyographic signals from facial muscles, which accompany facial expressions, can cause noise in the EEG signals. Therefore, if EEG signals are used as they are for training AI models, there is a risk of inadvertently constructing a model that decodes phrases based on electromyographic signals rather than EEG signals. To address this, the team intentionally created artificial data by adding electromyographic signals from uttering a different phrase B to the EEG signals from uttering phrase A. They trained the model to identify correct phrase A when presented with such artificial data (Figure 4). By training the AI model using this method, the team confirmed that when input only with electromyographic signals derived from facial muscles during speech, the accuracy significantly decreased (Figure 4).
Figure 4 Training paradigm to mitigate the impact of electromyographic signals on decoding
(a) In order to train an AI model for decoding with reduced influence of EMG, data augmentation with synthetic data of EMG and EEG was performed. The target phrase was trained to be the correct answer when inputted with various proportions of EMG data obtained from different phrases added to the EEG data obtained when the target phrase was uttered (CLIP loss between the audio latent of the target phrase and the latent of the synthetic EEG). (b) When the model was not trained to suppress EMG and only EMG was used as input for decoding, the correct answer was obtained with a maximum accuracy of 28.8% (top-10 accuracy). This is lower than the accuracy of the EEG model (76.0%), but higher than the 1.95% chance of being correct by chance. On the other hand, the model in method (a), in which electromyograms (uEMG) measured near the upper lip were used to suppress the influence of electromyography, was found to be only 3.05% accurate at best when electromyograms were input, regardless of the location of the myopotential acquisition. On the other hand, when EEG was used, it was found that phrases could be decoded with 70.1% accuracy.
The results of this study indicate the possibility of decoding speech from EEG by training AI with large-scale data, and represent an important step toward the realization of non-invasive speech BMI. On the other hand, we have not yet been able to clear all the problems caused by the various noises that occur during speech, and we will continue to promote research to clarify the effects and develop better reduction methods.
This achievement is made possible with EEG devices that require shaving. However, the team has already begun accumulating data using EEG devices that do not require shaving, with plans to build a large-scale dataset containing above 10,000 hours of EEG recordings. Along with this, the team is currently recruiting members to contribute to data collection. In particular, aiming to apply this technology as an option for individuals who feel societal barriers due to various reasons, including neurological conditions affecting their body or communication, the team is starting to create an environment where stakeholders can collaborate. For more details, please visit the recruitment page on the related links.
Furthermore, the team anticipates that this research can be applied across various societal contexts. To achieve this, the team believes collaboration with stakeholders across diverse fields and industries is essential. If you are interested or wish to conduct interviews, please feel free to contact us through the contact form provided in the related links below..
*This research is being conducted with the support of the Japan Science and Technology Agency (JST) under the Moonshot Research and Development Program, Moonshot Goal 1 “Realization of a society in which human beings can be free from limitations of body, brain, space, and time by 2050.”
【Related Links】
X Communication Team: https://research.araya.org/research/x-communication-team
Moonshot Goal 1 Kanai Project Internet of Brains: https://brains.link/en
"Neu World" (Published SF works themed around the research of X Communication):https://neu-world.link/
【Actively Recruiting! Job Posting for X Communication Team】※The recruitment page is available in Japanese only.
BMI Researcher/Engineer:https://herp.careers/v1/arayainc/gnuu5n44hjks
BMI Technician:https://herp.careers/v1/arayainc/CORUrPtcgwaI
BMI Experiment Member "BMI Pilot":https://herp.careers/v1/arayainc/NlXz1yTDk72n
【Interviews and collaborations are welcome at any time!】
Please contact us from the contact page below.
Contact:https://www.araya.org/contact/