T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath
We describe a system for model based speech separation whichachieves super-human recognition performance when two talkersspeak at similar levels. The system can separate the speech of twospeakers from a single channel recording with remarkable results.It incorporates a novel method for performing two-talker speakeridentification and gain estimation. We extend the method of modelbased high resolution signal reconstruction to incorporate tempo-ral dynamics. We report on two methods for introducing dynam-ics; the first uses dynamics in the acoustic model space, the secondincorporates dynamics based on sentence grammar. The additionof temporal constraints leads to dramatic improvements in the sep-aration performance. Once the signals have been separated theyare then recognized using speaker dependent labeling.