Audio Examples: Super-Human Speech Separation

Here are audio samples for the worlds first (and only?) Super-Human Speech Separation System.

Click Here for 2006 Audio Examples page.


Shell scripts in Javascript.

I’m a big fan of javascript.

My first taste of javascript was during a bout of insomnia when I wrote a little night light sleeping aid that replaced a 50$ gadget. Then came real time synthesis of chimes, then face recognition in the browser.

The one thing that I missed was being able to do shell scripting. I used to use sed, awk and python and even perl for this purpose depending on complexity of the task. Now I use javascript.

Replicating sed with Javascript and node.js – a Javascript shell script template

For the purpose of illustration let’s replicate the most basic and useful functionality of sed.

Here is our test text:


 The two most important days in your life are the day you are born and the day you find out why.
― Mark Twain

The javascript example below does the same as the following sed statement.

sed 's/day/night/g' someTextFile.txt

It replaces every instance of “day” with “night”. Here is an example of what it does:

$ echo "The two most important days in your life are the day you are born and the day you find out why." > MarkTwianQuote.txt
$ sed 's/day/night/g' MarkTwianQuote.txt
The two most important nights in your life are the night you are born and the night you find out why.

Basic javascript shell script for replacing strings (requires node.js):

#! /usr/local/bin/node  // Shebang to tell bash to use node.

var fs = require('fs');  // For reading files.

var args = process.argv.slice(2);  // Get the command line arguments.

var theFileToProcess = args[0];
var theRegexp = eval(args[1]);
var theReplacementString = args[2];

fs.readFile(theFileToProcess, "ascii", function (err, data) {
    if (err) {
        console.error(err);
        process.exit(1);
    }
 
    // Split file on linebreaks.
    lineArr = data.trim().split("n");

    // Loop over every line.
    lineArr.forEach(function (line) {

        // Find all instances of regexp and replace.
        var result = theRegexp.exec(line);
        if (result !== null) {
          var newLine = line.replace(theRegexp, theReplacementString);
          console.log(newLine);
        } else {
          console.log(line);
        }
    });
});

Now if you save the above script as replace.js, and run this on the command line you get:

$ echo "The two most important days in your life are the day you are born and the day you find out why." > MarkTwianQuote.txt
$ ./replace.js MarkTwianQuote.txt /day/g night
The two most important nights in your life are the night you are born and the night you find out why.

Processing large files

The above example takes the whole file into memory. If you have large files you will run out of memory with the above method.

To process a file that does not fit in memory, you can use the module line-reader which allows you to read a file line by line.

#! /usr/local/bin/node

// For reading large files line by line. Install with nmp install line-reader.
var lineReader = require('line-reader');

var args = process.argv.slice(2);  // Get the command line arguments.

var theFileToProcess = args[0];
var theRegexp = eval(args[1]);
var theReplacementString = args[2];

lineReader.eachLine(theFileToProcess, function(line, last) {

  // Find all instances of regexp and replace.
  var result = theRegexp.exec(line);
  if (result !== null) {
    var newLine = line.replace(theRegexp, theReplacementString);
    console.log(newLine);
  } else {
    console.log(line);
  }
});

Real time PCM output in Javascript and Web Audio API

Ripples on Ice Lake.

Ripples on Ice Lake.

Recently I upgraded a little piece of code that does real time synthesis of chimes to use the Web Audio API. Originally it was written with the simple Mozilla Audio Data API, which has been deprecated in favor of the far more capable but complex Web Audio API.

This is a quick post on real time PCM output with Web Audio API.

PCM sample output with Audio Data API (deprecated)

Back in 2010, Mozilla came out with the first native PCM output in the browser that didn’t require Flash.

The method was beautifully simple and straightforward and all you needed if you had a source of PCM audio samples:

function myPCMGenerationFunction(soundData) {
// Overwrite the soundData array with your PCM samples.
}

The you needed a to set up the destination.

var pcm_audio_destination = new AudioDataDestination(sampleRate, myPCMGenerationFunction);

Finally, you needed to start the sound.

// Start consuming PCM samples and sending them to the speaker.
pcm_audio_destination.start();

// Eventually, stop playing PCM samples.
pcm_audio_destination.stop();

Easy peasy.

PCM sample output with Web Audio API

Later, Google contributed an API called Web Audio API. It is much more capable and much more complex. It has the kitchen sink thrown in, but all we need here it raw PCM output.

For a long time I could not find documentation on how to accomplish this simple thing with the new API. Recently however I came across this terrific post that uses this method. Turns out it is not too bad.

Here is an example of a noise generator:

function myPCMSource() {  
  return  Math.random() * 2 - 1;  // For example, generate noise samples.
}

The rest is boilerplate.

var audioContext;
try {
  window.AudioContext = window.AudioContext || window.webkitAudioContext;
  audioContext = new AudioContext();
} catch(e) {
  alert('Web Audio API is not supported in this browser');
}

var bufferSize = 4096;
var myPCMProcessingNode = audioContext.createScriptProcessor(bufferSize, 1, 1);
myPCMProcessingNode.onaudioprocess = function(e) {
  var output = e.outputBuffer.getChannelData(0);
  for (var i = 0; i < bufferSize; i++) {
     // Generate and copy over PCM samples.
     output[i] = myPCMSource(); 
  }
}
myPCMProcessingNode.connect(audioContext.destination);
myPCMProcessingNode.start(0);

Here is a jsfiddle to show this in action. Warning, turn down your audio!

 

Real time processing of microphone input

If you need to do real time processing of microphone input, followed by playback, then you need to hook up the microphone audio source.

 
function myPCMFilterFunction(inputSample) {
  var noiseSample = Math.random() * 2 - 1;
  return inputSample + noiseSample * 0.1;  // For example, add noise samples to input.
}

The rest is boiler plate to set up the microphone etc.

var bufferSize = 4096;

var audioContext;
try {
  window.AudioContext = window.AudioContext || window.webkitAudioContext;
  audioContext = new AudioContext();
} catch(e) {
  alert('Web Audio API is not supported in this browser');
}

// Check if there is microphone input.
try {
  navigator.getUserMedia = navigator.getUserMedia ||
                           navigator.webkitGetUserMedia ||
                           navigator.mozGetUserMedia ||
                           navigator.msGetUserMedia;
  var hasMicrophoneInput = (navigator.getUserMedia || navigator.webkitGetUserMedia ||
      navigator.mozGetUserMedia || navigator.msGetUserMedia);
} catch(e) {
  alert("getUserMedia() is not supported in your browser");
}

// Create a pcm processing "node" for the filter graph.
var bufferSize = 4096;
var myPCMProcessingNode = audioContext.createScriptProcessor(bufferSize, 1, 1);
myPCMProcessingNode.onaudioprocess = function(e) {
  var input = e.inputBuffer.getChannelData(0);
  var output = e.outputBuffer.getChannelData(0);
  for (var i = 0; i < bufferSize; i++) {
     // Modify the input and send it to the output.
     output[i] = myPCMFilterFunction(input[i]);
  }
}

var errorCallback = function(e) {
  alert("Error in getUserMedia: " + e);
};  

// Get access to the microphone and start pumping data through the  graph.
navigator.getUserMedia({audio: true}, function(stream) {
  // microphone -> myPCMProcessingNode -> destination.
  var microphone = audioContext.createMediaStreamSource(stream);
  microphone.connect(myPCMProcessingNode);
  myPCMProcessingNode.connect(audioContext.destination);
  //microphone.start(0);
}, errorCallback);

You can try this out in a jsfiddle here.

Screeching audio feedback and ripples on a lake

Ripples build up on a lake in a steady breeze.

Ripples build up on a lake in a steady breeze.

Audio feedback is a big issue when using the above real time microphone processing code on a laptop. Audio feedback is also a big issue in PA systems and hearing aids. Echo cancellation can be used to take care of this.

However, mixing noise with the microphone audio reminded me of a phenomenon that I noticed long ago.

Have you ever observed the build up of small waves on a pond in a steady breeze? Have you noticed that if it starts raining the waves will disappear?

For the case of rain on a lake, the feedback loop is analogous to the wind exciting the waves on the lake and the noise is analogous to rain. I wondered if the same would happen to the screeching audio feedback.

Here is another jsfiddle to try this out.

It’s a subtle effect, but if you turn down the noise, any sound will trigger feedback and quickly build. If you turn up the noise it becomes less sensitive to the buildup of feedback and can even disappear!


Candy Crush Monetization and Virality

What can app developers learn from Candy Crush?Screen Shot 2013-09-29 at 12.28.09 PM


 I spent a Saturday afternoon “studying” Candy Crush and taking heaps of screenshots and found the monetization and social features to be fascinating.

The game Candy Crush is a lot of fun and really addictive. It’s so addictive that I wouldn’t recommend trying it if you have any other commitments that day! People that get addicted have been known to waste substantial amounts of money on it.

The interesting thing about Candy Crush from an app developer’s perspective is it’s use of social media and it’s monetization strategy: WAIT, SOCIALIZE or PAY or WaSP for short.

Virality

CC is very similar to Bejeweled, which is one of the most successful games of all time. CC goes a few steps beyond Bejeweled in making the game addictive and does an admirable job of adding the social component to the game. As a side effect the game is more fun and supremely viral. CC includes a number of social interactions and cues:

  • Your friend’s position in the game is shown on the map of Candy Land [screenshot].
  • Your friend’s high scores and their pictures appears at the bottom of the screen [screenshot].
  • You can invite friends to join [screenshot].
  • You can ask your Facebook friends for lives! [screenshot]
  • You can give lives to your friends (in theory: the functionality was broken) [screenshot] [screenshot]
  • Not only does it give you a score, but CC identifies which friend’s score you have beaten [screenshot] and gives you an option brag about it to them [screenshot].

The essence of these interactions are:

  • Camaraderie
  • Invitiation/Inclusion
  • Receiving
  • Giving
  • Friendly Competition

The competitive aspects of Candy Crush are not applicable to non-game apps. However, if an app has in-app purchases, Facebook invitations and giving and receiving of items via Facebook should be directly applicable.

We at Rubber Duck Labs make an app that allow users to purchase sound packs through in-app-purchases. Giving users the option of sharing these purchases with friends could be a really fun idea.

Monetization

Candy Crush relies on in-app-purchases of a few items for monetization:

  • More lives.
  • A few more moves to complete a level.
  • Items to help complete a level such as lollypops and fish that eat candy.

WaSP – Wait, Socialize or Pay:

Socialize or Pay

Wait, Socialize or Pay – WaSP

CC uses a bizzarre and fascinating ploy to get a user to pay for more lives. Once the user has used up all lives, the user has three options:

  • Wait for an increasing amount of time until they can play again.
  • Ask for lives via facebook.
  • Pay!

The timeout seems like a risky strategy since it risks loosing customers. On the other hand, it might prevent users from overdosing on the game and giving it up.

It would be fascinating to see the  statistics for how people behave when confronted with this choice. The main questions on my mind are:

  • How many people ask friends for lives?
  • How many people pay for more lives?
  • How many people never come back?

What did you do when you ran out of lives?

Virality Boost: asking for a life:

When a user runs out of lives, she can wait, pay or ask a friend for a life. The process that that friend has to go through is:

  1. The friend gets a facebook request to download the app [screenshot].
  2. Then the user has to play one level,
  3. and only then do they get the request for the life [screenshot].
  4. They then need to send a life via Facebook.

In my case, the life was not delivered. I assume this is due to a bug in their code.

Warning: SPAM!

Diceptive dialog.

Deceptive dialog.

When the user asks a friend for a life via Facebook,  CC shows a dialog asking for permission to post to friends [screenshot]. When I saw this, I thought CC needed this permission to ask a friend for a new life on my behalf.

However, CC used this to spam all my friends that I was using CC!! The post is concealed from me as it did not appear on my own Facebook timeline or in my feed. I only found out when a friend liked the post. This was embarrassing.

By doing this, CC betrayed my trust and I promptly deleted the Candy Crush from Facebook and my iPhone. However, the post was still there on my friend’s feed [screenshot].

Take Aways:

We at Rubber Duck Labs make video creation apps where the user can play sound effects to catch a child’s attention. We allow users to purchase sound packs through in-app-purchases.

Wait:  Since we make video creation apps the idea of making our users wait to record a video would not fly! However, we have already incorporated the Wait strategy into our app. Users can play a set number previews of the sound effect items. When the previews are used up, the user can pay to unlock all the sounds, or wait until the next day to continue. Users can continue to use the basic set of sounds and use all other features such as the “PVR for real life” feature, adding badges as well as composing and sharing videos.

Socialize: In our case, sharing of videos via Facebook gives a bit of virality. However, we could learn from Candy Crush here.  Giving users the option of giving a friend a set of sound effects is a fun idea that our users would enjoy.

A final word: We take the privacy of our users very seriously. We would not betray our users trust by spamming about their activity on Facebook. However, in other respects, we can learn a lot from Candy Crush.

Screen shots:

Here is a collection of screenshot depicting the various aspects of the WaSP strategy. The images were anonymized using Facepixelizer.

 

Candy Land mapPlayFriendsAtBottom_censoredCandyCrunchRequestAfterInstall_censoredCandyCrushHighScoreBeat_censoredCandyCrunchInviteFriends_censoredCandyCrunchMessageCenter_censoredCandyCrushSendLives_censoredCandyCruchInvite_censoredCandyCrunchHighscoreBeatWriteOnWall_censoredCandyCrushAskFriendsForLife_censoredCandyCrushFBPrivacySettings_censored  CandyCrushTimelineSpam_censoredCandyCrushStuck_censored

 

 


Scientific American Magazine features our work on Super Human Speech Separation

The April 2011 issue of Scientific American Magazine features our work on “Super Human Speech Separation”.

Scientific American: Solving the Cocktail Party Problem

The Cocktail Party

Solving the Cocktail Party Problem

Congrats to the team!!


Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System

T. Kristjansson,  J. Hershey,  P. Olsen,  S. Rennie,  R. Gopinath,

Interspeech 2006, Winner of PASCAL Speech Separation Challenge

Abstract:

We describe a system for model based speech separation which achieves super-human recognition performance when two talkers speak at similar levels. The system can separate the speech of two speakers from a single channel recording with remarkable results. It incorporates a novel method for performing two-talker speaker identification and gain estimation. We extend the method of model based high resolution signal reconstruction to incorporate tempo- ral dynamics. We report on two methods for introducing dynam- ics; the first uses dynamics in the acoustic model space, the second incorporates dynamics based on sentence grammar. The addition of temporal constraints leads to dramatic improvements in the sep- aration performance. Once the signals have been separated they are then recognized using speaker dependent labeling.

Super Human Speech Separation Paper PDF