Webinar 10: Silent Speech Interfaces & Human-Computer Interface (HCI) / Machine Learning (ML) aspects. (Apr. 28., 2022)

Silent Speech Interfaces (SSI) is a revolutionary field of speech technologies, having the main idea of recording the articulatory movement, and automatically generating speech from the movement information, while the original subject is producing no sound. This research area, also known as articulatory-to-acoustic mapping (AAM) has a large potential impact in a number of domains, and might be highly useful for the speaking impaired (e.g., after laryngectomy), and for scenarios where regular speech is not feasible but information should be transmitted from the speaker (e.g., extremely noisy environments; military applications). Voice assistants are getting popular lately, but they are still not in every home. One of the reasons is privacy concerns: some people do not feel comfortable if they have to speak loud, having others around – but an SSI equipment can be a solution for that.

There are two distinct ways of SSI solutions, namely 'direct synthesis' and 'recognition-and-synthesis'. In the first case, the speech signal is generated without an intermediate step, directly from the articulatory data. In the second case, silent speech recognition (SSR) is applied on the biosignal which extracts the content spoken by the person (i.e. the result of this step is text); this step is then followed by text-to-speech (TTS) synthesis. In the SSR+TTS approach, any information related to speech prosody (intonation and durations) is lost, whereas it may be kept with direct synthesis. In addition, the smaller delay with the direct synthesis approach might enable conversational use; therefore, we have been following this approach in our project.

To fulfill the above goals, we formulated a multidisciplinary team with expert senior researchers in speech synthesis, recognition, deep learning, and articulatory data acquisition. As biosignals, 2D ultrasound, lip video and magnetic resonance imaging were used to image the motion of the speaking organs. In our experiments, we used standard deep learning approaches (convolutional and recurrent neural networks, autoencoders) and high-potential novel machine learning methods (adversarial training, neural vocoders and cross-speaker experiments). When designing ML/DL approaches, it was not enough to test the system with objective measures (e.g. validation loss), but it was also important to keep in mind the human aspects. Therefore, after each deep learning experiment, we evaluated the resulting synthesized speech samples in subjective listening tests with potential users. Such an SSI system, being able to convert the silent articulation of any person to fully natural audible speech, is not yet available but we have made significant progress towards practical prototypes.

Watch on LinkedIn: https://www.linkedin.com/video/event/urn:li:ugcPost:6924797758058999808/