Thad Hughes, Trausti Kristjansson
We consider the task of speech recognition with loud music background interference. We use model-based music-speech separation and train GMM models for music on the audio prior to speech. We show over 8% relative improvement in WER at 10 dB SNR for a real world Voice Search ASR sys- tem.
We investigate the relationship between ASR accuracy and the amount of music background used as prologue and the the size of music models.
Our study shows that performance peaks when using a music prologue of around 6 seconds to train the music model. We hypothesize that this is due to the dynamic nature of mu- sic and the structure of popular music. Adding more history beyond a certain point does not improve results. Additionally, we show moderately sized 8-component music GMM models suffice to model this amount of music prologue.