One of the many new features coming in the next release of Windows Phone—a.k.a Mango—is Music search, a built-in song recognition feature jointly developed with researchers on the Bing team. We haven’t said much about it yet, so recently I sat down with several members of the Augmented Reality team (how cool is that name?) in Windows Phone Engineering to find out what it does and how it works.
We met in the second-floor office of Elliot Kirk, who was responsible for testing the new feature. Kirk’s office is stuffed with the tools of his trade: stereo equipment, a handheld decibel meter, and a gilded, football-sized cockroach impaled with a nail (his job is killing software “bugs” after all). Joining us were program manager Steve Cosman and Houston Wong, lead programmer on the feature. If you like what they had to say, make sure to check out the new Channel 9 interview with Steve about Music search.
Q: So, guys, first tell me: What can I do with Music search?
Elliot: The basic scenario is that you’re listening to some music that you’ve never heard—or you hear a song you like but don’t remember the name of it. In Mango, you can just pull out your phone and within seconds get the name of the song or artist and also a link to the Zune music store so you can download or buy it.
Steve: Anything you can buy in Zune Marketplace you can find with Music search.
Q: Some apps in Marketplace can already do this—the identifying part, at least. How is Music search different?
Steve : Most other apps listen to a song for a fixed amount of time, and then analyze and try to match it. One of the things we do differently is we’re continuously listening and analyzing. As soon as we know what the song is, we return the result to you.
Houston: What this means is that you might actually get near instant results in the extreme case.
Q: That’s cool. How does Music search work?
Steve: We’re using the microphone to record and then doing something called ”fingerprinting,” where we look for unique acoustic features of the music. We listen for about 3 seconds, create a fingerprint, and then we send that fingerprint to Bing, which looks for a match in the Zune music catalog.
Elliot: If it doesn’t find one, we send another 3-second slice until we get a result.
Members of the Augmented Reality team in Windows Phone: (from left) Houston Wong, Steve Cosman, and Elliot Kirk.
Q: You don’t transmit the actual audio?
Steve: No, we don’t—which means we’re using less of your data plan. And it’s generally quicker to use fingerprints since we’re matching against a very large data set of music: millions and millions of tracks.
Q: So someone has already scanned all the tunes in the Zune catalog and created a library of digital fingerprints for each song?
Steve: Yep, exactly. At that point it’s pretty much just a straight up search. We look at the fingerprint we’ve created on the phone, compare it to the millions of fingerprints generated from tracks in the Zune music catalog, and see what matches.
Q: How was Bing involved?
Elliot: This started as a Bing research project. They developed the fingerprinting algorithm. The Bing team has been amazing. It’s been a great experience working with them.
Q: Are some kinds of songs more challenging to match than others—say, all those covers of “Louie Louie” or samples of original songs embedded in other songs, like you find in hip hop?
Steve: It’s pretty interesting how we pick the right track. We’re working on that still. One problem is when you get, for example, the German karaoke version of Britney Spears’ “Toxic” instead of the multi-platinum U.S. album version. There’s also the situation where the identical hit song is on 25 different albums. How do you figure out which one to return? That’s another problem we deal with.
Elliot: But the fingerprint is actually getting good enough that we can identify the album version of a song from the live version—as long as that album is in the database.
Steve: No one plays a song the exact same way twice. Even your ear can’t detect the differences we can.
Coming in Mango: Tap the music note icon in Bing to start a new Music search.
Q: Are there other situations Music search finds challenging?
Steve: It’s fine if there are voices talking over the music. But if you sing along with a song, you might screw up the detection process.
Elliot: When you sing over a song you actually alter the timing of it, so the fingerprint we create doesn’t quite match the original.
Q: Sounds like this feature must have really been a challenge to test.
Steve [gesturing to Elliot]: His test stories are the best. There was this point where Elliot for about a week was running around to everybody’s office to find who had the quietest office for our tests.
Elliot: We wanted to know what the lowest and highest sound levels we could detect were. I also came in late on a Saturday evening hoping no one else was around to find out how loud we could go. It turns out to be around 120 decibels, which is almost as loud as a jet engine. I had to upgrade from PC speakers to a full 110-watt receiver with surround sound. I had my earplugs in and my fingers over my ears, and I just cranked it. That was fun. You could hear it all the way across the building.
Q: What song did you use?
Elliot: I think it was Britney Spears’ “Toxic”. It was painful.
Steve: I wouldn’t have admitted that if I were you.
Q: Anything else?
Elliot: We went out and tested the last 10 years’ worth of tracks from Billboard, to make sure all the most popular tracks are detectable.
There were also a lot of constraints that we had to either model or go out and test in the real world—like background noise in the places where we think most people will use the feature. I spent a lot of time driving to and from work at different speeds with my windows open different amounts—just to make sure we can recognize the songs if the window was all the way open and we were doing 40 m.p.h and had the music blaring.
Steve: You should never use your phone while driving.
Elliot: I also spent a lot of time in bars— but not necessarily just for testing.