Ever since the release of ChatGPT, generative artificial intelligence has been inspiring people all over the world. However, the technology’s benefits go far beyond text-based chatbots. In the future, generative AI will also help with automatic announcements on trains or in industry, using natural-sounding spoken language. The Audio and Media Technologies division at Fraunhofer IIS is driving this form of generative AI forward in a number of projects.
It happens every day in open-plan offices the world over: On screen, you can see your colleagues discussing an important topic. But in their background, other employees are having their own meetings, so you end up hearing rather more of the background noise than you do of the actual topic you’re trying to discuss. In the future, though, once generative artificial intelligence finds its way into laptops, smartphones, and the like, this scene will be a thing of the past. Conversations happening simultaneously in the background will be almost entirely filtered out, thanks to upHear Target Speaker Extraction. To date, this has been achieved using traditional, or discriminative, AI: all the model needs to be able to generate a digital fingerprint of someone’s voice is a few seconds of training data. The fingerprint is used to amplify the speaker’s voice and block out background conversations. “Thanks to AI methods, this already works very well,” says Jan Plogsties, Strategy Manager Generative AI at Fraunhofer IIS. The institute developed the technology as part of its many years of work on solutions to improve audio quality. Indeed, various products in the Fraunhofer IIS upHear family make use of AI – from smart speakers and smartphones to microphones for conference calls. Generative AI will make operating the technology even more efficient in the future. For instance, it could optimize the quality of the spoken word, even in the presence of extremely loud noise interference such as ventilation, vacuum cleaners, or street noise.
Generative AI differs from discriminative AI in that it can generate completely new content that has never existed before. That includes not only text but also new images, videos, and audio content. Having seen huge amounts of data during their training, the models can generate very plausible new content on the basis of very little information. This is the decisive advantage over traditional AI algorithms.