T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath,
Interspeech 2006, Winner of PASCAL Speech Separation Challenge
We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate tempo- ral dynamics. We report on two methods for introducing dynam- ics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the sep- aration performance. Once the signals have been separated they are then recognized using speaker dependent labeling.