John R. Hershey, Steven J. Renniea, Peder A. Olsena, Trausti T. Kristjansson
Abstract: We present a system that can separate and recognize the simultaneous speech of two people recorded in a single channel. Applied to the monaural speech separation and recognition challenge, the system out-performed all other participants– including human listeners – with an overall recognition error rate of 21.6%, compared to the human error rate of 22.3%.The system consists of a speaker recognizer, a model-based speech separation module, and a speech recognizer. For theseparation models we explored a range of speech models that incorporate diﬀerent levels of constraints on temporaldynamics to help infer the source speech signals. The system achieves its best performance when the model of temporaldynamics closely captures the grammatical constraints of the task. For inference, we compare a 2-D Viterbi algorithmand two loopy belief-propagation algorithms. We show how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model. The best belief-propagationmethod results in nearly the same recognition error rate as exact inference.
Please cite this article in press as: Hershey, J.R. et al., Super-human multi-talker speech recognition: …, ComputerSpeech and Language (2009), doi:10.1016/j.csl.2008.11.001