New in D3D12 – DRED helps developers diagnose GPU faults

Avatar

Bill

DRED stands for Device Removed Extended Data. DRED is an evolving set of diagnostic features designed to help identify the cause of unexpected device removal errors, delivering automatic breadcrumbs and GPU-page fault reporting on hardware that supports the necessary features (more about that later).

DRED version 1.1 is available today in the latest 19H1 builds accessible through the Windows Insider Program (I will refer to this as ‘19H1’ for the rest of this writing). Try it out and please send us your feedback!

Auto-Breadcrumbs

In Windows 10 version 1803 (April 2018 Update / Redstone 4) Microsoft introduced the ID3D12GraphicsCommandList2::WriteBufferImmediate API and encouraged developers to use this to place “breadcrumbs” in the GPU command stream to track GPU progress before a TDR. This is still a good approach if a developer wishes to create a custom, low-overhead implementation, but may lack some of the versatility of a standardized solution, such as debugger extensions or Watson reporting.

DRED Auto-Breadcrumbs also uses WriteBufferImmediate to place progress counters in the GPU command stream. DRED inserts a breadcrumb after each “render op” – meaning, after every operation that results in GPU work (e.g. Draw, Dispatch, Copy, Resolve, etc…). If the device is removed in the middle of a GPU workload, the DRED breadcrumb value is essentially a count of render ops completed before the error.

Up to 64KiB operations in a given command list are retained in the breadcrumb history ring buffer. If there are more than 65536 operations in a command list then only the last 64KiB operations are stored, overwriting the oldest operations first. However, the breadcrumb counter value continues to count up to UINT_MAX. Therefore, LastOpIndex = (BreadcrumbCount – 1) % 65536.

DRED v1.0 was “released” in Windows 10 version 1809 (October 2018 Update / Redstone 5) exposing rudimentary AutoBreadcrumbs. However there were no API’s and the only way to enable DRED was to use FeedbackHub to capture a TDR repro for Game Performance and Compatibility. The primary purpose for DRED in 1809 was to help root cause analyze game crashes via customer feedback.

Caveats

  • Because GPU’s are heavily pipelined, there is no guarantee that the breadcrumb counter will indicate the exact operation that failed. In fact on some tile-based deferred render devices, it is possible for the breadcrumb counter to be a full resource or uav barrier behind the actual GPU progress.
  • Drivers can reorder commands, pre-fetch from resource memory well before executing a command, or flush cached memory well-after completion of a command. Any of these can produce GPU errors. In such cases the autobreadcrumb counters may be less helpful or misleading.

Performance

Although Auto-Breadcrumbs are designed to be low-overhead, they are far from free. Empirical measurements show between 2-5% performance loss on typical “AAA” D3D12 graphics game engines. For this reason, Auto-Breadcrumbs are off-by-default.

Hardware Requirements

Because the breadcrumb counter values must be preserved after device removal, the resource containing breadcrumbs must exist in system memory and must persist in the event of device removal. This means the driver must support D3D12_FEATURE_EXISTING_HEAPS. Fortunately, this is true for most 19H1 D3D12 drivers.

GPU Page Fault Reporting

A new DRED v1.1 feature in 19H1 is DRED GPU Page Fault Reporting. GPU page faults commonly occur when:

  1. An application mistakenly executes work on the GPU that references a deleted object.
    • Seemingly, one of the top reasons for unexpected device removals
  2. An application mistakenly executes work on the GPU that accesses an evicted resource or non-resident tile.
  3. A shader references an uninitialized or stale descriptor.
  4. A shader indexes beyond the end of a root binding.

DRED attempts to address some of these scenarios by reporting the names and types of any existing or recently freed API objects that match the VA of the GPU-reported page fault.

Performance

The D3D12 runtime must actively curate a collection of existing and recently-deleted API objects indexable by VA. This increases the system memory overhead and introduces a small performance hit to object creation and destruction. For now this is still off-by-default.

Hardware Requirements

Many, but not all, GPU’s currently support GPU page faults. Hardware that doesn’t support page faulting can still benefit from Auto-Breadcrumbs.

Caveat

Not all GPU’s support page faults. Some GPU’s respond to memory faults by bit-bucket writes, reading simulated data (e.g. zeros), or simply hanging. Unfortunately, in cases where the GPU doesn’t immediately hang, TDR’s can happen later in the pipe, making it even harder to locate the root cause.

Setting up DRED in Code

DRED settings must be configure prior to creating a D3D12 Device. Use D3D12GetDebugInterface to get an interface to the ID3D12DeviceRemovedExtendedDataSettings object.

Example:

Accessing DRED Data in Code

After device removal has been detected (e.g. Present returns DXGI_ERROR_DEVICE_REMOVED), use ID3D12DeviceRemovedExtendedData methods to access the DRED data for the removed device.

The ID3D12DeviceRemovedExtendedData interface can be QI’d from an ID3D12Device object.

Example:

Debugger Access to DRED

Debuggers have access to the DRED data via the d3d12!D3D12DeviceRemovedExtendedData data export. We are working on a WinDbg extension that helps simplify visualization of the DRED data, stay tuned for more.

DRED Telemetry

Applications can use the DRED API’s to control DRED features and collect telemetry for post-mortem analysis. This gives app developers a much broader net for catching those hard-to-repro TDR’s that are a familiar source of frustration.

As of 19H1, all user-mode device-removed events are reported to Watson. If a particular app + GPU + driver combination generates enough device-removed events, Microsoft may temporarily enable DRED for customers launching the same app on a similar configuration.

DRED V1.1 API’s

D3D12_DRED_VERSION

Version used by D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA.

Constants
D3D12_DRED_VERSION_1_0 – Dred version 1.0
D3D12_DRED_VERSION_1_1 – Dred version 1.1

D3D12_AUTO_BREADCRUMB_OP

Enum values corresponding to render/compute GPU operations

D3D12_DRED_ALLOCATION_TYPE

Congruent with and numerically equivalent to D3D12DDI_HANDLETYPE enum values.

D3D12_DRED_ENABLEMENT

Used by ID3D12DeviceRemovedExtendedDataSettings to specify how individual DRED features are enabled. As of DRED v1.1, the default value for all settings is D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED.

Constants
D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED – The DRED feature is enabled only when DRED is turned on by the system automatically (e.g. when a user is reproducing a problem via FeedbackHub)
D3D12_DRED_ENABLEMENT_FORCED_ON – Forces a DRED feature on, regardless of system state.
D3D12_DRED_ENABLEMENT_FORCED_OFF – Disables a DRED feature, regardless of system state.

D3D12_AUTO_BREADCRUMB_NODE

D3D12_AUTO_BREADCRUMB_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

Members
pCommandListDebugNameA – Pointer to the ANSI debug name of the command list (if any)
pCommandListDebugNameW – Pointer to the wide debug name of the command list (if any)
pCommandQueueDebugNameA – Pointer to the ANSI debug name of the command queue (if any)
pCommandQueueDebugNameW – Pointer to the wide debug name of the command queue (if any)
pCommandList – Address of the command list at the time of execution
pCommandQueue – Address of the command queue
BreadcrumbCount – Number of render operations used in the command list recording
pLastBreadcrumbValue – Pointer to the number of GPU-completed render operations
pCommandHistory – Pointer to the array of “render operations” used by the command list
pNext – Pointer to the next node in the list or nullptr if this is the last node

D3D12_DRED_ALLOCATION_NODE

Describes allocation data for a DRED-tracked allocation. If device removal is caused by a GPU page fault, DRED reports all matching allocation nodes for active and recently-freed runtime objects.

D3D12_DRED_ALLOCATION_NODE objects are singly linked to each other via the pNext member. The last node in the list will have a null pNext.

D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT

Contains pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE structures.

Members
pHeadAutoBreadcrumbNode – Pointer to the head of a linked list of D3D12_AUTO_BREADCRUMB_NODE objects

D3D12_DRED_PAGE_FAULT_OUTPUT

Provides the VA of a GPU page fault and contains a list of matching allocation nodes for active objects and a list of allocation nodes for recently deleted objects.

Members
PageFaultVA – GPU Virtual Address of GPU page fault
pHeadExistingAllocationNode – Pointer to head allocation node for existing runtime objects with VA ranges that match the faulting VA
pHeadRecentFreedAllocationNode – Pointer to head allocation node for recently freed runtime objects with VA ranges that match the faulting VA

D3D12_DEVICE_REMOVED_EXTENDED_DATA1

DRED V1.1 data structure.

Members
DeviceRemovedReason – The device removed reason matching the return value of GetDeviceRemovedReason
AutoBreadcrumbsOutput – Contained D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT member
PageFaultOutput – Contained D3D12_DRED_PAGE_FAULT_OUTPUT member

D3D12_VERSIONED_DEVICE_REMOVED_EXTENDED_DATA

Encapsulates the versioned DRED data. The appropriate unioned Dred_* member must match the value of Version.

Members
Dred_1_0 – DRED data as of Windows 10 version 1809
Dred_1_1 – DRED data as of Windows 10 19H1

ID3D12DeviceRemovedExtendedDataSettings

Interface controlling DRED settings. All DRED settings must be configured prior to D3D12 device creation. Use D3D12GetDebugInterface to get the ID3D12DeviceRemovedExtendedDataSettings interface object.

Methods
SetAutoBreadcrumbsEnablement – Configures the enablement settings for DRED auto-breadcrumbs.
SetPageFaultEnablement – Configures the enablement settings for DRED page fault reporting.
SetWatsonDumpEnablement – Configures the enablement settings for DRED watson dumps.

ID3D12DeviceRemovedExtendedDataSettings::SetAutoBreadcrumbsEnablement

Configures the enablement settings for DRED auto-breadcrumbs.

Parameters
Enablement – Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetPageFaultEnablement

Configures the enablement settings for DRED page fault reporting.

Parameters
Enablement – Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedDataSettings::SetWatsonDumpEnablement

Configures the enablement settings for DRED Watson dumps.

Parameters
Enablement – Enablement value (defaults to D3D12_DRED_ENABLEMENT_SYSTEM_CONTROLLED)

ID3D12DeviceRemovedExtendedData

Provides access to DRED data. Methods return DXGI_ERROR_NOT_CURRENTLY_AVAILABLE if the device is not in a removed state.

Use ID3D12Device::QueryInterface to get the ID3D12DeviceRemovedExtendedData interface.

Methods
GetAutoBreadcrumbsOutput – Gets the DRED auto-breadcrumbs output.
GetPageFaultAllocationOutput – Gets the DRED page fault data.

ID3D12DeviceRemovedExtendedData::GetAutoBreadcrumbsOutput

Gets the DRED auto-breadcrumbs output.

Parameters
pOutput – Pointer to a destination D3D12_DRED_AUTO_BREADCRUMBS_OUTPUT structure.

ID3D12DeviceRemovedExtendedData::GetPageFaultAllocationOutput

Gets the DRED page fault data, including matching allocation for both living, and recently-deleted runtime objects.

Parameters
pOutput – Pointer to a destination D3D12_DRED_PAGE_FAULT_OUTPUT structure.

Avatar
Bill Kristiansen

Follow Bill   

No Comments.