Audio Examples: Super-Human Source Separation

T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath

Download the ICSLP 2006 Paper Here

Audio examples:

The SSC task involves two people speaking at the same time. The goal is to identify the letter and number of the speaker that says “white“. Our system first extracts the two component sentences and then sends them to a speech recognition system.

Listen to the mixed examples and see if you can do it!

Same female speaker, target speaker at -6 dB:

Play Input Signal
Try it! (hint: listen for “white”)
Play Separated Target:
Bin white in D 1 again
Play Separated Masker:
Lay green in I 3 soon

Same male speaker, target speaker at 0 dB:

Play Input Signal:
Try it! (hint: listen for “white”)
Play Separated Target:
Set white at H 6 now
Play Separated Masker
Bin blue with R 4 please

Same female speaker, target speaker at 0 dB, swapped:

In many cases, word boundaries in the two sentences are closely aligned making the task (almost) impossible. In these cases, our system will arbitrarily stitch together sentence fragments.

Play Input Signal:
Try it! (hint: it may be impossible)
Play Separated Target:
Lay white at N 4 please (should be: Lay white at I 5 again )
Play Separated Masker:
Place green in I 5 again (should be: Place green in N 4 please)

Dynamics:

The following examples show the importance of temporal constraints. Grammar dynamics force the separated signals to contain whole words. Acoustic dynamics only enforce short time smoothness.

Same Talker Condition

Notice the dramatic improvement when grammar constraints are used in this condition.

Input Signal

Separated Target

Separated Masker

-6 dB, No dynamics: play
play
play
-6 dB, Acoustic dynamics: play
play
play
-6 dB, Grammar + acoustic dynamics: play
play
play

Different Talker Condition

In this condition, recordings of two male talkers or two female talkers are mixed together.

Input Signal

Separated Target

Separated Masker

-3 dB, No dynamics: play
play
play
-3 dB, Acoustic dynamics: play
play
play
-3 dB, Grammar + acoustic dynamics: play
play
play

Different Gender Condition

In this condition, recordings of male and female talkers are mixed together. Dynamics help to smooth the result, but the short time signature of the voices suffice to separate them.

Input Signal

Separated Target

Separated Masker

0 dB, No dynamics: play
play
play
0 dB, Acoustic dynamics: play
play
play
0 dB, Grammar + acoustic dynamics: play
play
play

Results:

Our system achieved super human performance in multiple conditions of the 2006 ICSLP Speech Separation Challenge.

Bar chart with word error rates for different conditions.

References:

T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath,
Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System
Submitted to ICSLP 2006

J. Hershey and M. Casey,
Audio-Visual Sound Separation Via Hidden Markov Models,
in Advances in Neural Information Processing Systems 14, 2002

T. Kristjansson, J. Hershey, H. Attias,
Single Microphone Source Separation using High Resolution Signal Reconstruction,
IEEE International Conference on Acoustics, Speech and Signal Processing, 2004

B.J. Frey, T. Kristjansson, L. Deng, A. Acero,
Learning dynamic noise models from noisy speech for robust speech recognition
Advances in Neural Information Processing (NIPS) 2001

P. Olsen and S. Dharanipragada,
An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models
in Proceedings of Eurospeech 2003