Publication: VocalSep: High-Resolution Target Speaker Extraction
Files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Access Restrictions
Abstract
Target Speech Separation (TSE) is the task of isolating an individual speakers from an auditory scene composed of a mixture of multiple speakers and environmental noise. Recent models in the larger field of audio source separation have achieved impressive performance utilizing convolutional neural networks. These models vary in their use cases, from isolating individual instruments in music to more general-use models capable of separating based on a language query (text description). Impressive performance has also been achieved by recent “voice encoder” models capable of creating useful representations of the characteristics of a speaker’s voice. This thesis seeks to combine the methods of recent audio source separation and voice encoder models to isolate individual voices from complex auditory scenes containing multiple speakers and environmental noise. While previous TSE models have succeeded in extracting individual voices from an auditory scene, they can only be used on low sample rate audio that captures frequencies less than half the human-audible range. In this work, I introduce VocalSep, a high resolution TSE model that uses a short audio prompt of a target speaker to recognize and extract their voice from noisy audio mixtures containing multiple speakers.