VocalSep: High-Resolution Target Speaker Extraction

Eggert, Sam

Publication:
VocalSep: High-Resolution Target Speaker Extraction

datacite.rights	restricted
dc.contributor.advisor	Finkelstein, Adam
dc.contributor.author	Eggert, Sam
dc.date.accessioned	2025-08-06T14:19:43Z
dc.date.available	2025-08-06T14:19:43Z
dc.date.issued	2025-04-10
dc.description.abstract	Target Speech Separation (TSE) is the task of isolating an individual speakers from an auditory scene composed of a mixture of multiple speakers and environmental noise. Recent models in the larger field of audio source separation have achieved impressive performance utilizing convolutional neural networks. These models vary in their use cases, from isolating individual instruments in music to more general-use models capable of separating based on a language query (text description). Impressive performance has also been achieved by recent “voice encoder” models capable of creating useful representations of the characteristics of a speaker’s voice. This thesis seeks to combine the methods of recent audio source separation and voice encoder models to isolate individual voices from complex auditory scenes containing multiple speakers and environmental noise. While previous TSE models have succeeded in extracting individual voices from an auditory scene, they can only be used on low sample rate audio that captures frequencies less than half the human-audible range. In this work, I introduce VocalSep, a high resolution TSE model that uses a short audio prompt of a target speaker to recognize and extract their voice from noisy audio mixtures containing multiple speakers.
dc.identifier.uri	https://theses-dissertations.princeton.edu/handle/88435/dsp015425kf140
dc.language.iso	en
dc.title	VocalSep: High-Resolution Target Speaker Extraction
dc.type	Princeton University Senior Theses
dspace.entity.type	Publication
dspace.workflow.startDateTime	2025-05-07T21:03:49.781Z
pu.contributor.authorid	920278178
pu.date.classyear	2025
pu.department	Computer Science
pu.minor	Statistics and Machine Learning

Files

Original bundle

Now showing 1 - 1 of 1

Name:: written_final_report.pdf
Size:: 2.33 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 100 B
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Computer Science, 1987-2025

Publication: VocalSep: High-Resolution Target Speaker Extraction

Files

Original bundle

License bundle

Collections

Publication:
VocalSep: High-Resolution Target Speaker Extraction