VocalSep: High-Resolution Target Speaker Extraction

Eggert, Sam

Publication:
VocalSep: High-Resolution Target Speaker Extraction

Files

written_final_report.pdf (2.33 MB)

Date

2025-04-10

Authors

Eggert, Sam

Abstract

Target Speech Separation (TSE) is the task of isolating an individual speakers from an auditory scene composed of a mixture of multiple speakers and environmental noise. Recent models in the larger field of audio source separation have achieved impressive performance utilizing convolutional neural networks. These models vary in their use cases, from isolating individual instruments in music to more general-use models capable of separating based on a language query (text description). Impressive performance has also been achieved by recent “voice encoder” models capable of creating useful representations of the characteristics of a speaker’s voice. This thesis seeks to combine the methods of recent audio source separation and voice encoder models to isolate individual voices from complex auditory scenes containing multiple speakers and environmental noise. While previous TSE models have succeeded in extracting individual voices from an auditory scene, they can only be used on low sample rate audio that captures frequencies less than half the human-audible range. In this work, I introduce VocalSep, a high resolution TSE model that uses a short audio prompt of a target speaker to recognize and extract their voice from noisy audio mixtures containing multiple speakers.

URI

https://theses-dissertations.princeton.edu/handle/88435/dsp015425kf140

Collections

Computer Science, 1987-2025

Full item page

Publication:
VocalSep: High-Resolution Target Speaker Extraction

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication: VocalSep: High-Resolution Target Speaker Extraction

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Research Projects

Organizational Units

Journal Issue

Access Restrictions

Abstract

Description

Keywords

Citation

URI

Collections

Publication:
VocalSep: High-Resolution Target Speaker Extraction