T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath
Download the ICSLP 2006 Paper Here
Audio examples:
The SSC task involves two people speaking at the same time. The goal is to identify the letter and number of the speaker that says “white“. Our system first extracts the two component sentences and then sends them to a speech recognition system.
Listen to the mixed examples and see if you can do it!
Same female speaker, target speaker at -6 dB:
Play Input Signal |
Try it! (hint: listen for “white”) |
Play Separated Target: |
Bin white in D 1 again |
Play Separated Masker: |
Lay green in I 3 soon |
Same male speaker, target speaker at 0 dB:
Play Input Signal: |
Try it! (hint: listen for “white”) |
Play Separated Target: |
Set white at H 6 now |
Play Separated Masker |
Bin blue with R 4 please |
Same female speaker, target speaker at 0 dB, swapped:
In many cases, word boundaries in the two sentences are closely aligned making the task (almost) impossible. In these cases, our system will arbitrarily stitch together sentence fragments.
Play Input Signal: |
Try it! (hint: it may be impossible) |
Play Separated Target: |
Lay white at N 4 please (should be: Lay white at I 5 again ) |
Play Separated Masker: |
Place green in I 5 again (should be: Place green in N 4 please) |
Dynamics:
The following examples show the importance of temporal constraints. Grammar dynamics force the separated signals to contain whole words. Acoustic dynamics only enforce short time smoothness.
Same Talker Condition
Notice the dramatic improvement when grammar constraints are used in this condition.
Input Signal |
Separated Target |
Separated Masker |
|
-6 dB, No dynamics: | play |
play |
play |
-6 dB, Acoustic dynamics: | play |
play |
play |
-6 dB, Grammar + acoustic dynamics: | play |
play |
play |
Different Talker Condition
In this condition, recordings of two male talkers or two female talkers are mixed together.
Input Signal |
Separated Target |
Separated Masker |
|
-3 dB, No dynamics: | play |
play |
play |
-3 dB, Acoustic dynamics: | play |
play |
play |
-3 dB, Grammar + acoustic dynamics: | play |
play |
play |
Different Gender Condition
In this condition, recordings of male and female talkers are mixed together. Dynamics help to smooth the result, but the short time signature of the voices suffice to separate them.
Input Signal |
Separated Target |
Separated Masker |
|
0 dB, No dynamics: | play |
play |
play |
0 dB, Acoustic dynamics: | play |
play |
play |
0 dB, Grammar + acoustic dynamics: | play |
play |
play |
Results:
Our system achieved super human performance in multiple conditions of the 2006 ICSLP Speech Separation Challenge.
References:
T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath,
Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System
Submitted to ICSLP 2006
J. Hershey and M. Casey,
Audio-Visual Sound Separation Via Hidden Markov Models,
in Advances in Neural Information Processing Systems 14, 2002
T. Kristjansson, J. Hershey, H. Attias,
Single Microphone Source Separation using High Resolution Signal Reconstruction,
IEEE International Conference on Acoustics, Speech and Signal Processing, 2004
B.J. Frey, T. Kristjansson, L. Deng, A. Acero,
Learning dynamic noise models from noisy speech for robust speech recognition
Advances in Neural Information Processing (NIPS) 2001
P. Olsen and S. Dharanipragada,
An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models
in Proceedings of Eurospeech 2003