Media capture functionality in Microsoft Edge
The media capture functionality in Microsoft Edge is implemented based on the W3C Media Capture and Streams specification that recently reached the Working Group “Last Call” status. We are very excited that major websites such as Facebook share the same vision and have adopted standards-based interfaces to enable the best user experiences across browsers.
In this blog, we will share insights on some of our implementation decisions, and share details on what we have implemented today and what we are still working on for a future release. We will also suggest some best practices when using the media capture APIs.
A brief summary of the Media Capture and Streams APIs
The getUserMedia() method is a good starting point to understand the Media Capture APIs. The getUserMedia() call takes MediaStreamConstraints as an input argument, which defines the preferences and/or requirements for capture devices and captured media streams, such as camera facingMode, video resolution, and microphone volume. Through MediaStreamConstraints, you can also pick the specific captured device using its deviceId, which can be derived from the enumerateDevices() method. Once the user grants permission, the getUserMedia() call will return a promise with a MediaSteam object if the specific MediaStreamConstraints can be met.
The MediaStream object will have one or both of the following: one MediaStreamTrack for the captured video stream from a webcam, and one MediaStreamTrack for the captured audio stream from a microphone. The MediaStream object can be rendered on multiple rendering targets, for example, by setting it on the srcObject attribute of MediaElement (e.g. video or audio tags), or the source node of a Web Audio graph. The MediaStreamTracks can also be used by the ORTC API (which we are in the process of implementing) to enable real-time communications.
User permissions
While media capture functionality can enable a lot of exciting user and business scenarios, it also introduces security and privacy concerns. Therefore, user consent on streaming audio and video from the capture devices is a critical part of the feature. The W3C spec recommends some best practices and meanwhile also leaves some flexibility to each browser’s implementation. In order to balance security and privacy concerns and user experiences, the implementation in Microsoft Edge does the following:
- If the webpage is from an HTTP origin, the user will be prompted for permission when a capture device is accessed through the getUserMedia() call. We will allow permission to persist for the specific capture device type until all capture devices of the specific type are released by the webpage.
- For webpages from an HTTPS origin, when a user grants permission for a webpage to access a capture device, the permission will persist for the specific capture device type. If the user navigates away to another page, all permissions will be dismissed. Microsoft Edge does not store any permanent permissions for a page or domain.
- When a webpage calls getUserMedia() from an iframe, we will manage the capture device permission separately based on its own URL. This provides protection to the user in cases where the iframe is from a different domain than its parent webpage.
Once a user grants permission for a webpage to access a media capture device, it is important to help the user to track which browser tab is actively using the capture device, especially when the user has navigated to a different tab. Microsoft Edge will use a “recording” badge in the tab title to indicate tabs streaming audio and/or video data from the capture devices. Note that this feature is not implemented in the current release.
Capture device selection and settings
The getUserMedia() interface allows a lot of flexibility in capture device selection and settings through MediaStreamConstraints. The W3C spec has very detailed descriptions on the Constrainable Pattern and corresponding decision process. We’d like to share more of our implementation details, especially regarding default expectations.
The following table summarizes the default setting we have internally on some of the constraints.
Constraints | Default values * |
width | 640 |
height | 360 |
aspectRatio | 1.7777777778 (16:9) |
frameRate | 30 |
volume | 1.0 |
sampleRate | device default |
sampleSize | device default (16 or 32-bit) |
When setting the constraints, please keep in mind that capture devices tend to have wide range of different capabilities. Unless your target scenario has any must-have requirement, you should allow flexibility as much as possible for the browser to make device selection and setting decisions for you. Our capture pipeline is currently limited to device default audio sample size and sample rate and doesn’t currently support setting a different sampleSize or sampleRate. Additionally,our capture pipeline currently relies on the global setting in the Windows audio device manager to determine the audio sampleRate of specific microphone devices.
If you plan to use the media capture streams jointly with ORTC in real-time communications, we suggest not setting the “volume” constraint. The Automatic Gain Control logic in the ORTC component will be invoked to handle the volume levels dynamically. The volume level can also be adjusted by Windows users through the audio device manager tool.
We don’t currently have any default or preferred facingMode mode for webcams. Instead, we encourage you to set the facingMode for your specific scenarios. Where not specified, we will try to pair up the webcam with the microphone in the device selection logic.
There is a known issue in the video capture pipeline which doesn’t allow setting webcam resolution in this preview release. We are working on a fix which we expect should be available in the next Insider build.
In case the deviceId or groupId is not explicitly set in the MediaStreamConstraints, we will go through the following logic to select the capture device. Here, let us assume we want to select one microphone and one webcam:
- If there is one set of capture devices that satisfy the MediaStreamConstraints with the best match, we will choose those devices.
- Otherwise, if multiple microphones and webcams match the MediaStreamConstraints equally well:
- We first pick the system default microphone device for communications if it is on the list. We then pick the webcam that pairs with the microphone if there is one, or pick the first webcam on the webcam list.
- If the system default microphone is not defined, we will enumerate through the capture devices to pair up the microphone and webcam based on its groupId. The first pair we find will be the one we select.
- If the above fails, we will pick the first microphone and first webcam from the candidate device list.
A headset, when plugged in, provides the default microphone device and default audio rendering option for communications scenarios. Advanced Windows users can change their audio device settings manually for specific purposes through the audio devices manager.
Updates to Media Elements
We have updated our Media Elements (audio/video tags) implementation to enable using it as a rendering target for MediaStream objects. The W3C spec has a table with a very good summary of the changes to the Media Elements. As part of our implementation decision, we now internally handle all real-time media streams, either from local capture device or from a remote ORTC receiver object, using the Windows Media Foundation low-latency playback mode (i.e. the real-time mode). For video capture using built-in webcams, we also handle device rotation internally by setting the right property on video samples so the video tag can render video frames in the correct orientation.
In some other implementations of the feature, “srcObject” is not supported. Developers would need to convert a MediaStream object using the URL.createObjectURL() method and then set it on the “src” attribute of the Media Element. We do not currently support that legacy behavior, and instead follow the latest W3C spec. Both Chrome and Firefox currently have active tickets to track “srcObject” support.
Promises vs. callback patterns
Based on the W3C spec, we support both the promise-based getUserMedia() method and the callback-based getUserMedia() method. The callback-based method allows an easier transition if you have a webpage using the interface already (although it might be a vendor-prefixed version). We encourage web developers to use the promise-based approach to follow the industry trend for new interface design styles on the web.
Missing features
Our implementation does not currently support getting video resolutions not natively supported by the webcam. This is largely due to a lack of a video DSP module in our media capture pipeline. We currently don’t have plans to address this in the near term.
We also currently don’t support echoCancellation in our MediaTrackConstraintSet. This is a limitation in our current media capture pipeline. We plan to support echo cancellation in our ORTC media stack for real-time communications in a future update.
Sample scenarios using media capture
Media capture is an essential step in many scenarios, including real-time audio and video communications, snapping a photo or capturing a barcode, or recording a voice message. Below we walk through a couple simple scenarios introducing you to how to use the Media Capture functionality.
Scenario #1: Capture photo from webcam
First, get a video stream from a webcam and put it in a video tag for preview. Let’s assume we have a video tag on the page and it is set to autoplay.
navigator.mediaDevices.getUserMedia({ | |
video: { | |
facingMode: "user" | |
}}).then(function (stream) { | |
var video = document.getElementById('videoTag'); | |
video.srcObject = stream; | |
}).catch( function (error) { | |
console.log(error.name + ": " + error.message); | |
}); |
Here is one example which accounts for the legacy src approach before all browsers support the standards-based approach:
var video = document.getElementById('videoTag'); | |
if (typeof(video.srcObject) != "undefined") { | |
video.srcObject = stream; | |
} | |
else { | |
video.src = URL.createObjectURL(stream); | |
} |
Next, copy a video frame onto a canvas. Let’s assume we have set up the event listener so when you tap the video tag, we will invoke the following function:
function capturePhoto() | |
{ | |
var video = document.getElementById('videoTag'); | |
var canvas = document.getElementById('canvasTag'); | |
var videoWidth = video.videoWidth; | |
var videoHeight = video.videoHeight; | |
if (canvas.width !== videoWidth || canvas.height !== videoHeight) | |
{ | |
canvas.width = videoWidth; | |
canvas.height = videoHeight; | |
} | |
var ctx = canvas.getContext('2d'); | |
ctx.drawImage(video, 0, 0, video.videoWidth, video.videoHeight); | |
} |
Finally, save the picture:
function savePhoto() | |
{ | |
var canvas = document.getElementById('canvasTag'); | |
var imgData = canvas.msToBlob("image/jpeg"); | |
navigator.msSaveBlob(imgData, "myPhoto.jpg"); | |
} |
You can also change code here to upload the data blob to your web server.
Don’t forget to release the webcam device after you complete the task. Some earlier browser implementations have a stop() method on the MediaStream object. That is not supported by the W3C spec. Instead, you should call the stop() method on the MediaStreamTrack object to release the capture device. For example:
var video = document.getElementById('videoTag'); | |
var videoTracks = mediaStream.getVideoTracks(); | |
video.srcObject = null; | |
videoTracks[0].stop(); |
Once you know how to use one webcam, it should not be difficult to introduce a camera switching feature to your page to handle multiple webcams. You can check out our demo at the Microsoft Edge Dev site for more.
Scenario #2: Capture a voice message from microphone
Now let’s look at a simple example using the microphone.
First, get an audio stream from a microphone and set it as the source node of a web audio graph.
var audioContext = new AudioContext(); | |
navigator.mediaDevices.getUserMedia({ | |
audio: true | |
}).then(function (stream) { | |
var sourceNode = audioContext.createMediaStreamSource(stream); | |
var gainNode = audioContext.createGain(); | |
sourceNode.connect(gainNode); | |
… | |
}).catch( function (error) { | |
console.log(error.name + ": " + error.message); | |
}); |
Next, extract the audio data from a web audio ScriptProcessorNode. This is too lengthy to include here, but you can check out the actual demo code at the Microsoft Edge Dev site and GitHub repository, where you can also add some simple web audio DSP filters before the ScriptProcessorNode.
Finally, save the audio data into a wav file. Once we have the audio data, we can add a wav file header and save the data blob. Again, please check out the actual demo code at the Microsoft Edge Dev site.
We’ve only talked about a couple simple examples as starting points. Once you’ve gotten the captured video stream into a video tag and then canvas, and gotten captured audio into web audio, it’s easy to see many scenarios that start to light up with just a bit more work. We’re eager for your feedback so we can further improve our implementation, and meanwhile we are looking forward to seeing what you do with these new tools!
– Shijun Sun, Senior Program Manager, Microsoft Edge