02 December 2021

Voice from the computer

A long way to a computer that reads our thoughts aloud

Adam Rogers, WIRED: The Long Search for a Computer That Speaks Your Mind

Translated by Alexander Gorlov, XX2 century

The trick is to synthesize speech in real time by reading brain signals, allowing — simultaneously — the machine to learn and the user to practice. Gradually, as a result of this work, new brain—computer interface (BCI) systems are being created.

Here's how a recent study was constructed: a woman speaks Dutch into a microphone, and sensors in the form of eleven tiny rods of platinum and iridium record the waves generated by her brain cells.

This 20-year—old volunteer has epilepsy, and doctors have inserted two—millimeter rods into the front and left parts of her brain - several electrodes on each, from 8 to 18 pieces - in the hope of identifying the foci of the disease. But this application of "microacupuncture" in the framework of neurological research was a major success for another team of scientists: the fact is that the embedded electrodes contact the parts of the brain responsible for verbalization of thoughts and articulation of speech.

What this second team is doing is very interesting. After a woman says something (this is called "explicit speech"), and the computer algorithmically connects the sounds of speech with the activity of her brain, the researchers ask her to repeat what she said. This time she barely whispers, pronouncing the words almost soundlessly with the help of lips, tongue and jaws. This is "implicit (intended) speech". And then the woman repeats all this again, but now — without making any articulations. At the request of the researchers, she simply imagines that she is uttering words.

synthesize-speech.jpg

Schematic representation of the experiment.

This is a reproduction of an ordinary speech act in reverse order. In real life, we formulate silent ideas in one part of the brain, another puts them into words, and then other parts of the brain control the movements of the mouth, tongue, lips and larynx, producing audible speech from sounds of a suitable frequency. Here, the computer allows a woman's thoughts to "skip the queue": it registers what she has planned to say (in scientific terms, this is called "imaginary speech"), and, interpreting signals coming from the brain in real time, immediately forms sounds based on them and reproduces sounds. So far, these sounds do not look like understandable words. The work done by the researchers, the results of which were published at the end of September (Angrick et al., Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity), is in some sense preliminary. But the simple fact that the computer synthesized sounds in milliseconds, with the speed characteristic of thinking and practical actions, testifies to amazing progress: brain—computer interfaces (BCI) are reaching a level that will allow giving voice to people who are unable to speak.

This inability — a consequence of a neurological disorder or brain damage — is called anarthria. It is terrible, it is exhausting, but there are still some means to combat it. People suffering from anarthria, that is, people who are unable to speak independently, can use devices that translate the movements of various parts of the body into letters or words; even blinking will do here. Recently, a BCI implanted in the cerebral cortex of a subject with isolation syndrome (deefferentation) made it possible to translate an imaginary creation of a handwritten text into the actual text at a speed of 90 characters per minute. This is good, but not great: the average conversation speed in English is about 150 words per minute.

Unfortunately, it is very difficult to form and produce speech (as well as, for example, to control a cursor or a robotic arm with the help of a thought). Success depends on feedback, on a 50-millisecond loop between the moment we say something and the moment we hear it. The feedback loop allows people to monitor the quality of their own speech in real time. Therefore, it is very important for a person who is learning to speak to listen to speech first of all: to make sounds, hear them (using the ears, auditory cortex and other parts of the brain) and compare what is obtained with what you want to get.

The problem is that even the best IMCS and computers need much more time to switch from registering a brain signal to making a sound. The team working with a Dutch-speaking woman managed to reduce this time to 30 milliseconds, but the sounds produced by their system were illegible, words were not guessed in them. If this can be corrected, then, theoretically, the 30-millisecond interval is short enough to provide feedback that would allow the user, practicing on such a system, to learn to use it better over time. "We have a very small data set, only a hundred words, and besides, we had very little time to conduct the experiment, so we could not give her [the female subject of the study] the opportunity to practice properly," says Christian Herff, a computer scientist at Maastricht University and one of the lead authors of the above-mentioned article. "We just wanted to show that by practicing on audible speech, you can get something from imaginary speech."

Neurophysiologists have been investigating the reception of speech signals from the human brain for at least 20 years. After learning more about the origin of speech, researchers use electrodes and imaging to scan brain activity during a speech act. Step by step, they move forward, receiving data that can be turned into vowels and consonants. But it's not easy. "In particular, it is very difficult to explore imaginary speech and it is very difficult to grasp its meaning," he says Ciaran Cooney from the University of Ulster (Ulster University), who studies BMI and speech synthesis. "There are interesting discussions going on here, because we need to find out how closely imaginary and explicit speech are related to each other, if we are going to confirm the existence of this connection with explicit speech."

It is especially difficult to interpret signals from the parts of the brain responsible for speech production, especially from the inferior frontal gyrus. (You will get there if you pierce the skull with a knitting needle just above the temple. [Don't try to do that.]) Imaginary speech is not just your thinking mind, or your inner monologue; it's probably more like what you hear in your mind, pondering what to say. The way the brain does this may differ—syntactically, phonologically, rhythmically—from what actually comes out of your mouth. In different people, the encoding of information in the speech areas of the brain can be carried out in different ways. In addition, before the mouth does any work, everything that has been sorted out by the parts of the brain associated with the language must pass the way to the premotor and motor cortex, which control physical movements. If you are trying to create a system that will be used by people who are unable to speak, they will not explain to you in their own words what this system should be, will not confirm that it synthesizes exactly what they want to say. At the same time, each IMC platform for speech synthesis requires this kind of confirmation and training. "The study of imaginary speech is a serious problem because we don't have observable data," says Herff.

In 2019, a team working at the University of California in San Francisco (UC San Francisco), — found an elegant workaround. The scientists asked the study subjects to speak and recorded signals not only from the parts of the brain responsible for the production of words (the lower frontal gyrus), but also from the regions controlling the movements of the mouth, tongue, jaws, etc. This is the ventral sensorimotor cortex, which is located above the place where you did not poke with a knitting needle. The team has created a machine learning system capable of turning ventral signals into a virtual version of speech mechanical movements and synthesizing understandable words, but not in real time. This approach is called an open loop system.

This team is led by a UCSF neuroscientist Eddie Chang competes with a team that is experimenting with a Dutch-speaking woman, and receives funding from a company that we have not yet lost the habit of calling "Facebook". This year, Chang and his colleagues published sensational results of another study: in July, they told how they connected electrodes to speech centers and adjacent areas of the cerebral cortex of a subject who was numb as a result of a stroke. Having trained this person for a year and a half, we managed to get a system capable of capturing the intention to say any word out of fifty. Using an algorithm that predicts the most likely word order and a speech synthesizer, the system created by the researchers allowed the subject to pronounce sentences consisting of eight words at a speed of about 12 words per minute. This was the first practical test of the capabilities of such systems when used by people suffering from anarthria. Synthetic speech has not yet been used in real time, but more powerful computers will speed things up. "We were able to use his [subject's] illegible, almost silent signals to produce and decode speech within the framework of the language we created," says Gopala Anumanchipalli, a computer scientist and neuroengineer at UCSF and the University of California at Berkeley (UC Berkeley), who was involved in this study. "And already now we are working to ensure that the subject can speak in real time."

This approach, which reduces the lexicon to 50 words and thus made the results of the work of Chang's team more accurate and understandable, has drawbacks. Since there is no feedback loop, the subject cannot correct the computer's error when he incorrectly picked up the words. In addition, it took 81 weeks for the subject to learn how to use 50 words of the lexicon created for him. Imagine how long it would take to learn in the case of a lexicon containing 1000 words. "The more words you add to this system, the less suitable it is for practical use," says Frank Guenther, a speech neuroscientist at Boston University, who did not work on this project. — When moving to 100 words, it becomes so difficult to decode each word, the number of possible combinations increases so much that it becomes very difficult to make relevant predictions. Meanwhile, the vocabulary of most people contains not 50, but thousands of words."

The desire of scientists, in particular the Herff group, to create a closed loop system that works in real time, is explained by the fact that they aim to provide users with the ability to produce sounds, not words. Phonemes such as "oh", "hh" or even syllables and vowels are atomic units of speech. Create a library of neural correlates for them that the machine can perceive, and the user will be able to produce as many different words as he wants. Theoretically, this is so. Gunther was part of a team that in 2009 used a BCI implanted in the motor cortex of a subject with isolation syndrome to enable him to produce vowel sounds (but not full words) with a delay of only 50 milliseconds, with very decent accuracy to begin with. "The idea behind the closed loop system was simply to create acoustic capabilities for the production of any sounds," says Gunter. "On the other hand, a system with a vocabulary of 50 words will be much better than the current one if it works very reliably, and Chang's team is much closer to solving the problem of reliable decoding than any other."

To solve this problem, which will probably take five years, it is necessary to combine accuracy and clarity with the production of speech sounds in real time. "This is the general direction, all groups strive to do it, and do it in real time," says Anumanchipalli.

Larger and more sensitive electrode arrays can help here. Meta, the former Facebook, and Neuralink are engaged in their creation Ilona Mask. Extracting more data from the speech areas of the brain can facilitate the production of understandable synthetic phonemes in real time and the search for an answer to the question of whether the brains of different people are able to work in approximately the same way. If it is capable, the process of teaching subjects to use BCI will be quite simple, since each system will start from the same basic level. In this case, the learning process will become similar to visual control over the correctness of cursor movements and the development — with the help of biofeedback processes, which are not yet very clear - of more effective and reliable ways of action.

Otherwise, the priority area of research activity will be to improve algorithms for understanding and predicting what the brain is trying to do. Placing specially made electrode arrays on the subject's head in the best way, from a neurosurgical point of view, would be great, but the current rules of research ethics interfere. "It's very difficult to do this in Europe," says Herff, "so we are currently focusing on using a more complex algorithm that allows us to significantly improve the quality of speech, and on learning issues."

For the Anumanchipalli group, this goal is the main one. Today's BCI, approved for use by subjects, has fewer electrodes than scientists need to create a complete picture, although many hope that future technologies, such as Neuralink, will improve the situation. "We will undoubtedly always be limited in our brain research,— Anumanchipalli emphasizes. "If so, then whatever these limitations are, we must be ready to compensate algorithmically for the shortage of scientific data generated by them." This means that we need to think about how best to collect useful information, "how to create a protocol that is ideal for the subject to study the system, and the system to study the subject." In addition to data from the brain, the future speech synthesizer will be able to receive all kinds of other biometric streams at the input. He will be able to use, says Anumanchipalli, indicators of intention or desire such as locomotion or even heart rate. And any new system will be created in such a way that it is easy to master and use, so that the user does not abandon it because of fatigue or disappointment. "In my opinion, it's not far off," says Anumanchipalli. — The principles of work have already been fully identified and justified. Progress is slow, but I think the approach we are improving is the right one." Apparently, imaginary speech will one day cease to be only imaginary.

Portal "Eternal youth" http://vechnayamolodost.ru


Found a typo? Select it and press ctrl + enter Print version