TDR Debugging

Timeout detection and recovery (TDRs) and device removals occur as a result of a fault in the driver, Direct3D runtime, or hardware. A TDR or device removal causes a reset of the GPU adapter, at which point all device resources are lost. This is typically caused by invalid commands issued due to a bug in the application.

TDRs can be captured, analyzed, and debugged using PIX GPU captures. Note that in order to take a GPU capture for debugging a TDR, you must be able to reproduce the TDR while a GPU capture is in progress. If the TDR is caused by a race condition or other rare, hard to repro problem, it may not be possible to capture it using PIX.

Due to incompatibilities with current drivers, before using PIX to debug a TDR we recommend opening Settings and:

  • Unchecking Enable GPU Plugins
  • Checking Disable PIX HUD in applications running under GPU capture

 

Taking a GPU Capture of a TDR

In order to debug or analyze a TDR, a GPU capture must be taken of the frame or workload that produces the TDR. In order to capture a TDR it is recommended to use programmatic capture via the IDXGraphicsAnalysis interface. Programmatic capture allows an application to control when to begin and end GPU capture itself, which is typically the easiest way to capture the specific workload known to cause the TDR. For more information about programmatic capture, see here.

PIX GPU captures are robust against application-caused TDRs and crashes. If an application terminates unexpectedly during capture, PIX will produce a valid GPU capture containing every API call made prior to termination. A capture of an application that suffered a TDR during GPU capture can be identified by the “Device Removal (TDR) at capture” message in the Warnings pane when opening the file.

 

Debugging a TDR

 

Using PIX Remotely

It is strongly recommended to use PIX remoting to debug TDRs. PIX remoting allows the analysis engine to run on a different computer than the one running the PIX user interface. As debugging a TDR typically causes the analysis engine to TDR the machine it’s running on, attempting to debug TDRs locally may cause instability of the PIX user interface. More information about PIX remoting can be found here.

If you attempt to start analysis on a capture containing a TDR without using PIX remoting, PIX will emit the following warning:

 

Setting the ‘TdrLimit’ Registry Keys

By default, Windows will block applications that cause excessive TDRs from accessing the graphics hardware. This is intended as protection against denial-of-service attacks but can result in PIX itself being blocked from accessing graphics hardware. If this occurs, PIX analysis will fail.

To debug TDRs using PIX, the ‘TdrLimit’ debugging registry keys must be set in order to disable the Windows TDR denial-of-service protections. The two registry keys that must be set are:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
    "TdrLimitCount"=dword:00000020
    "TdrLimitTime"=dword:00000005

These registry keys require a reboot to take effect. Because these registry keys are intended for debugging only, they should not be set outside of testing and debugging scenarios (to unset, delete the TdrLimitCount and TdrLimitTime values using regedit). More information about these registry keys can be found at TDR Registry Keys.

PIX will detect an attempt to perform analysis on a machine which does not have these registry keys set, and display the following dialog:

Selecting ‘Yes’ will cause PIX to set the required registry keys on the analysis device automatically, and prompt for a reboot to allow them to take effect. Setting these registry keys requires administrative privileges, so a UAC prompt may appear to request elevation for PIX.

 

TDR Analysis

Once PIX is running remotely and the correct TdrLimit registry keys are set, PIX can be used to debug and inspect the capture file as usual. A new feature is available in Dr. PIX to aid in finding the GPU operation that triggered the TDR.

The ‘TDR Analysis’ experiment executes each GPU operation (such as draws and dispatches) in isolation and reports back the event that was determined to have triggered the TDR or device removal. Clicking the link in the ‘Event’ column will navigate to the relevant entry in the Event List.

Once the GPU operation is identified the other views in PIX can be used to debug it. In particular Shader Edit and Continue can be used to quickly iterate on shaders to narrow down the cause of the TDR.

Note that not all views will be available when a TDR is present. For example if executing a pixel shader on a draw call causes a TDR, PIX will be unable to display the rendering output in the Pipeline view for that draw call. However most views in PIX will generally continue to function when debugging a TDR. Collection of timing data and performance counters is not supported for captures that contain TDRs.

 

Using the Debug Layer and GPU Validation

The D3D12 Debug Layer is a powerful tool for determining the root cause of a TDR. When analyzing a capture PIX can automatically enable the debug layer and, optionally, run additional GPU Validation. Running the debug layer or GPU validation from within PIX is useful if, for example, the performance overhead of the debug layer is too great to enable it in the application itself. Many forms of invalid behavior (including those which can cause TDRs) are caught by the debug layer and GPU validation.