December 18, 2013

Optimizing DirectX apps for low latency input and longer battery life

Windows Apps Team

DirectX is a powerful graphics platform used in all sorts of Windows apps, ranging from high budget triple-A games to mainstream apps like newsreaders and photo manipulation apps. However, having great graphics alone doesn’t make for a positive customer experience. There are additional factors to consider when developing apps with DirectX. This blog post covers two of them: input responsiveness and power efficiency.

Input responsiveness

Input latency is defined as the time it takes for the system to respond to user input. The response is often a change in what’s displayed on the screen, or what’s heard through audio feedback.

Every input event, whether it comes from a touch pointer, mouse pointer, or keyboard generates a message to be processed by an event handler. Modern touch digitizers and gaming peripherals report input for these events at a minimum of 100 Hz per pointer, which means that apps can receive 100 events or more per second per pointer (or keystroke). This rate of updates is amplified if multiple pointers are happening concurrently or a higher precision input device is used (for example, a gaming mouse). The event message queue could fill up VERY quickly.

It’s important to understand the input latency demands of your apps so that events are processed in a way that is best for the scenario. There is no one solution for all apps.

Power efficiency

For the purpose of this blog post, we use the term power efficiency to refer to how much the app uses the GPU. An app that uses the GPU less is more power efficient and has longer battery life. The same principle holds for the CPU, but the GPU is the focus of this blog post.

If an app can afford to not redraw the whole screen at 60 frames per second (currently, the maximum rendering speed on most displays) without degrading the user’s experience, it will be more power efficient by drawing less often. Some apps only need to update the screen in response to user input, so those apps should not draw the same content repeatedly at 60 frames per second.

Choosing what to optimize for

When designing a DirectX app, you need to make some choices. Does the app need to render 60 frames per second to present smooth animation or does it only need to render in response to input? Does it need to have the lowest possible input latency or can it tolerate a little bit of delay? Will my users expect my app to be judicious about battery usage, or not?

The answers to these questions will likely align your app with one of the following scenarios:

Render on demand.
Apps in this category only need to update the screen in response to specific types of input. Power efficiency is excellent because the app doesn’t render identical frames repeatedly, and input latency is low because the app spends most of its time waiting for input. Board games and news readers are examples of apps that might fall into this category.
Render on demand with transient animations.
This scenario is similar to the first scenario except that certain types of input will kick off an animation that isn’t dependent on subsequent input from the user. Power efficiency is good because it doesn’t render identical frames repeatedly, and input latency is low while the app is not animating. Interactive children’s games and board games that animate each move are examples of apps that might fall into this category.
Render 60 frames per second.
In this scenario, the app is constantly updating the screen. Power efficiency is poor because it renders the maximum number of frames the display can present. Input latency is high because DirectX blocks the thread while content is being presented. Doing so prevents the thread from sending more frames to the display than it can show to the user. First person shooters, real-time strategy games, and physics-based games are examples of apps that might fall into this category.
Render 60 frames per second and achieve the lowest possible input latency.
Similar to scenario 3, the app is constantly updating the screen so power efficiency will be the same. The difference is that the app responds to input on a separate thread so that input processing isn’t blocked by presenting graphics to the display. Online multiplayer games or rhythm timing games might fall into this category as well as fighting games like Street Fighter because they support move inputs within extremely tight event windows.

Implementation

Most DirectX apps are driven by what is known as the game loop.. The basic algorithm is to do the following repeatedly until the user quits the game or app:

Process input.
Update the game state.
Draw the game content

When the content of a DirectX app is rendered and ready to be presented to the screen, the game loop waits until the GPU is ready to receive a new frame before waking up to process input again.

We’ll show the implementation of the game loop for each of the scenarios mentioned earlier by iterating on a simple jigsaw puzzle app. The source code for this app can be found in the MSDN samples gallery here: http://code.msdn.microsoft.com/windowsapps/CoreDispatcher-event-47f41a34. The decision points, benefits, and tradeoffs discussed with each implementation can serve as a guide to help you optimize your apps for low latency input and power efficiency.

Figure 1: Screenshot of the jigsaw puzzle app that is referenced in this blog post

Scenario 1: Render on demand

The first iteration of the jigsaw puzzle app only updates the screen when a user moves a puzzle piece. A user can either drag a puzzle piece into place or snap it into place by selecting it and then touching the correct destination. In the second case, the puzzle piece will jump to the destination with no animation or effects.

The code has a single-threaded game loop within the IFrameworkView::Run method that uses CoreProcessEventsOption::ProcessOneAndAllPending. Using this option dispatches all currently available events in the queue. If no events are pending, the game loop waits until one appears.

void App::Run()
{
// Notify the swap chain that this app intends to render each frame faster
// than the display's vertical refresh rate (typically 60Hz). Apps that cannot
// deliver frames this quickly should set this to 2.
    m_deviceResources->SetMaximumFrameLatency(1);

while (!m_windowClosed)
    {
// Wait for system events or input from the user.
// ProcessOneAndAllPending will block the thread until events appear and are processed.
        CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessOneAndAllPending);

// If any of the events processed resulted in a need to redraw the window contents, then we will re-render the
// scene and present it to the display.
if (m_updateWindow || m_state->StateChanged())
        {
            m_main->Render();
            m_deviceResources->Present();

            m_updateWindow = false;
            m_state->Validate();
        }
    }
}

Scenario 2: Render on demand with transient animations

In the second iteration, the app is modified so that when a user selects a puzzle piece and then touches the correct destination for that piece, it animates across the screen until it reaches its destination.

As before, the code has a single-threaded game loop that uses ProcessOneAndAllPending to dispatch input events in the queue. The difference now is that during an animation, the loop changes to use CoreProcessEventsOption::ProcessAllIfPresent so that it doesn’t wait for new input events. If no events are pending, ProcessEvents returns immediately and allows the app to present the next frame in the animation. When the animation is complete, the loop switches back to ProcessOneAndAllPending to again limit screen updates.

void App::Run()
{
// Notify the swap chain that this app intends to render each frame faster
// than the display's vertical refresh rate (typically 60Hz). Apps that cannot
// deliver frames this quickly should set this to 2.
    m_deviceResources->SetMaximumFrameLatency(1);

while (!m_windowClosed)
    {
// 2. Switch to a continuous rendering loop during the animation.
if (m_state->Animating())
        {
// Process any system events or input from the user that is currently queued.
// ProcessAllIfPresent will not block the thread to wait for events. This is the desired behavior when
// trying to present a smooth animation to the user.
            CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessAllIfPresent);

            m_state->Update();
            m_main->Render();
            m_deviceResources->Present();
        }
else
        {
// Wait for system events or input from the user.
// ProcessOneAndAllPending will block the thread until events appear and are processed.
            CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessOneAndAllPending);

// If any of the events processed resulted in a need to redraw the window contents, then we will re-render the
// scene and present it to the display.
if (m_updateWindow || m_state->StateChanged())
            {
                m_main->Render();
                m_deviceResources->Present();

                m_updateWindow = false;
                m_state->Validate();
            }
        }
    }
}

To support the transition between ProcessOneAndAllPending and ProcessAllIfPresent, the app must track state to know if it’s animating or not. In the jigsaw puzzle app, you do this by adding a new method on the GameState class that can be called during the game loop. The animation branch of the game loop drives updates in the state of the animation by calling GameState’s new Update method.

Scenario 3: Render 60 frames per second

In the third iteration, the app displays a timer that shows the user how long they’ve been working on the puzzle. Because it displays the elapsed time up to the millisecond, it must render 60 frames per second to keep the display up to date.

As in scenarios 1 and 2, the third iteration of the app has a single-threaded game loop. The difference with this scenario is that because it’s always rendering, it no longer needs to track changes in the game state as was done in the first two scenarios. As a result, it can default to use ProcessAllIfPresent for processing events. If no events are pending, ProcessEvents returns immediately and proceeds to render the next frame.

void App::Run()
{
// Notify the swap chain that this app intends to render each frame faster
// than the display's vertical refresh rate (typically 60Hz). Apps that cannot
// deliver frames this quickly should set this to 2.
    m_deviceResources->SetMaximumFrameLatency(1);

while (!m_windowClosed)
    {
if (m_windowVisible)
        {
// 3. Continuously render frames and process system events and input as they appear in the queue.
// ProcessAllIfPresent will not block the thread to wait for events. This is the desired behavior when
// trying to present smooth animations to the user.
            CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessAllIfPresent);

            m_state->Update();
            m_main->Render();
            m_deviceResources->Present();
        }
else
        {
// 3. If the window isn't visible, there is no need to continuously render.
// Process events as they appear until the window becomes visible again.
            CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessOneAndAllPending);
        }
    }
}

This approach is the easiest way to write an app because there’s no need to track additional state to determine when to render. It achieves the fastest rendering possible along with reasonable input responsiveness on a timer interval.

However, this ease of development comes with a price. Rendering at 60 frames per second uses more power than rendering on demand. It’s best to use ProcessAllIfPresent when the app is changing what is displayed every frame. It also increases input latency by as much as 16.7ms because the app is now blocking the game loop on the display’s sync interval instead of on ProcessEvents. Some input events might be dropped because the queue is only processed one time per frame (60 Hz).

Scenario 4: Render 60 frames per second and achieve the lowest possible input latency

Some apps may be able to ignore or compensate for the increase in input latency seen in scenario 3. But if low input latency is critical to the app’s functionality, apps that render 60 frames per second need to process input on a separate thread.

The fourth iteration of the jigsaw puzzle app builds on scenario 3 by splitting the input processing and graphics rendering from the game loop into separate threads. Having separate threads for each ensures that input is never delayed by graphics output; however, the code becomes more complex as a result. In scenario 4, the input thread calls ProcessEvents with CoreProcessEventsOption::ProcessUntilQuit, which waits for new events and dispatches all available events. It continues this behavior until the window is closed or the app calls the Close method on the CoreWindow instance.

void App::Run()
{
// 4. Start a thread dedicated to rendering and dedicate the UI thread to input processing.
    m_main->StartRenderThread();

// ProcessUntilQuit will block the thread and process events as they appear until the App terminates.
    CoreWindow::GetForCurrentThread()->Dispatcher->ProcessEvents(CoreProcessEventsOption::ProcessUntilQuit);
}

void JigsawPuzzleMain::StartRenderThread()
{
// If the render thread is already running then do not start another one.
if (IsRendering())
    {
return;
    }

// Create a task that will be run on a background thread.
auto workItemHandler = ref new WorkItemHandler([this](IAsyncAction^ action)
    {
// Notify the swap chain that this app intends to render each frame faster
// than the display's vertical refresh rate (typically 60Hz). Apps that cannot
// deliver frames this quickly should set this to 2.
        m_deviceResources->SetMaximumFrameLatency(1);

// Calculate the updated frame and render once per vertical blanking interval.
while (action->Status == AsyncStatus::Started)
        {
// Execute any work items that have been queued by the input thread.
            ProcessPendingWork();

// Take a snapshot of the current game state. This allows the renderers to work with a
// set of values that won't be changed while the input thread continues to process events.
            m_state->SnapState();

            m_sceneRenderer->Render();
            m_deviceResources->Present();
        }

// Ensure that all pending work items have been processed before terminating the thread.
        ProcessPendingWork();
    });

// Run the task on a dedicated high priority background thread.
    m_renderLoopWorker = ThreadPool::RunAsync(workItemHandler, WorkItemPriority::High, WorkItemOptions::TimeSliced);
}

It’s worth noting that the DirectX-XAML template in Visual Studio 2013 splits the game loop into multiple threads in a similar fashion. It uses the Windows::UI::Core::CoreIndependentInputSource object to start a thread dedicated to handling input and also creates a rendering thread independent of the XAML UI thread. For more details on these templates, read the Getting started with DirectX topics on MSDN.

Additional ways to reduce input latency

Use waitable swap chains

DirectX apps respond to user input by updating what the user sees on-screen. On a 60Hz display, the screen refreshes every 16.7ms (1 second/60 frames). Figure 2 shows the approximate lifecycle and response to an input event relative to the 16.7ms refresh signal (VBlank) for an app that renders 60 frames per second:

Figure 2: Event processing budget across frames

In Windows 8.1, DXGI introduced the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag for the swap chain, which allows apps to easily reduce this latency without requiring them to implement heuristics to keep the Present queue empty. Swap chains created with this flag are referred to as waitable swap chains. Figure 3 shows the approximate lifecycle and response to an input event when using waitable swap chains:

Figure 3: Event processing budget across frames using waitable swap chains

What we see from these diagrams is that apps can potentially reduce input latency by two full frames in Windows 8.1 if they are capable of rendering and presenting each frame within the 16.7ms budget defined by the display’s refresh rate. The jigsaw puzzle sample uses waitable swap chains and controls the Present queue limit by calling: m_deviceResources->SetMaximumFrameLatency(1);

Disable touch visualizations

To reduce latency between touch input and the presented frame, you may also want to disable the touch visualization effect that Windows enables by default (e.g. glow underneath finger on touch). While we don’t recommend disabling touch visualizations as common practice for all cases, it can be appropriate when building highly responsive applications that don’t benefit as much from visual touch feedback. For example, users won’t benefit from touch feedback when using the virtual controller in the XAML DirectX 3D shooting game sample. Disabling touch visualizations can save roughly one frame or 16.7ms of input latency. To disable, add the following code within the IFrameworkView::SetWindow method:

// Disable all pointer visual feedback for better performance when using touch.
auto pointerVisualizationSettings = PointerVisualizationSettings::GetForCurrentView();
pointerVisualizationSettings->IsContactFeedbackEnabled = false; 
pointerVisualizationSettings->IsBarrelButtonFeedbackEnabled = false;

Measurements

We tested the jigsaw puzzle app’s input latency on a Surface 2 RT tablet running Windows 8.1 by processing frame images taken with a 300fps camera. We found that the touch tap input to display output latency averaged 56.7ms when using waitable swap chains.

To make it easier for the camera to detect when the frame that handles the input event is rendered to the screen, the jigsaw puzzle app is instrumented to toggle the app’s background color when the pointer event is processed. This instrumentation is contained within a conditional compiler directive (#ifdef MEASURE_LATENCY).

Wrapping up

By understanding the input responsiveness and power demands of your app, you can use this guide to structure your app in a way that best meets those demands. As seen in the implementation of the jigsaw puzzle app, an input driven app only needs to update the screen in response to new input. This prevents identical frames from being rendered repeatedly, which improves power efficiency. Apps that are always animating might need to process input on another thread to get the best input latency.

Furthermore, you can perform optimizations like using waitable swap chains and disabling touch visualizations to get additional input latency savings.

We hope that you find this information helpful and applicable to your DirectX apps. The source code for the jigsaw puzzle app can be found in the MSDN samples gallery here: http://code.msdn.microsoft.com/windowsapps/CoreDispatcher-event-47f41a34.

–Jeff Piira, Test Manager, Windows Input and Composition Engine Team
–Bob Brown, Senior Software Development Engineer in Test, Windows Graphics

Tags: