June 30th, 2020

Hardware Accelerated GPU Scheduling

Steve Pronovost
Partner Development Lead

Abstract

You may have noticed a mysterious new optional feature called Hardware Accelerated GPU Scheduling appear in the advanced graphics settings page with the Windows 10 May 2020 update. The purpose of this blog is to give some background on this new feature and how we are introducing it. It is intended for folks curious about Windows internals. Remaining on the cutting edge of hardware innovation has always been a critical aspect of our graphics platform. Hardware Accelerated GPU Scheduling enables more efficient GPU scheduling between applications. For most users, this transition will be transparent. It is one of those things that if we do our job right, you will never know the transition happened. As the graphics platform continues to evolve, this modernization will enable new scenarios in the future.

WDDM GPU Scheduler

It has been almost 14 years since the introduction of the Windows Display Driver Model 1.0 (WDDM) and with it the introduction of GPU scheduling in Windows. Few likely remember the pre-WDDM days where applications could simply submit work to the GPU as much as they wanted. They submitted to a global queue where it was executed in a strict “first to submit, first to execute” fashion. These very rudimentary scheduling schemes were workable, at a time where most GPU applications were full screen games, being run one at a time.

With the transition to a broad set of applications using the GPU for richer graphics and animations, the platform needed to better prioritize GPU work to ensure a responsive user experience. Thus, the WDDM GPU scheduler was born.

Over time we have significantly enhanced the GPU scheduler at the heart of WDDM, supporting additional features and scenarios with each new WDDM version. However, throughout its evolution, one aspect of the scheduler was unchanged. We have always had a high-priority thread running on the CPU that coordinates, prioritizes, and schedules the work submitted by various applications.

This approach to scheduling the GPU has some fundamental limitations in terms of submission overhead, as well as latency for the work to reach the GPU. These overheads have been mostly masked by the way applications have traditionally been written. For example, an application would typically do GPU work on frame N, and have the CPU run ahead and work on preparing GPU commands for frame N+1. This buffering of GPU commands into batches allows an application to submit just a few times per frame, minimizing the cost of scheduling and ensuring good CPU-GPU execution parallelism.

An inherent side effect of buffering between CPU and GPU is that the user experiences increased latency. User input is picked up by the CPU during “frame N+1” but is not rendered by the GPU until the following frame. There is a fundamental tension between latency reduction and submission/scheduling overhead. Applications may submit more frequently, in smaller batches to reduce latency or they may submit larger batches of work to reduce submission and scheduling overhead.

Hardware-accelerated GPU scheduling

With Windows 10 May 2020 update, we are introducing a new GPU scheduler as a user opt-in, but off by default option. With the right hardware and drivers, Windows can now offload most of GPU scheduling to a dedicated GPU-based scheduling processor.

Windows continues to control prioritization and decide which applications have priority among contexts. We offload high frequency tasks to the GPU scheduling processor, handling quanta management and context switching of various GPU engines.

The new GPU scheduler is a significant and fundamental change to the driver model. Changing the scheduler is akin to rebuilding the foundation of a house while still living in it. To ensure a smooth transition we are introducing the new scheduler as an early-adopter, opt-in feature. During the transition we will gather large scale performance and reliability data as well as customer feedback.

We are adding UI to the Advanced Graphics Settings page to control enabling the new GPU scheduler. The settings page can be reached through Settings -> System -> Display -> Graphics Settings. If both your GPU and driver support the new GPU scheduler, the UI below will appear.

Which GPUs will support Hardware Scheduling?

The new GPU scheduler will be supported on recent GPUs that have the necessary hardware, combined with a WDDMv2.7 driver that exposes this support to Windows. Please watch for announcements from our hardware vendor partners on specific GPU generations and driver versions this support will be enabled for.

Hardware accelerated GPU scheduling is a big change for drivers. While some GPUs have the necessary hardware, the associated driver exposing this support will only be released once it has gone through a significant amount of testing with our Insider population.

If you are an Insider and have chosen to install a build of Windows from our Fast or Slow distribution ring, you have been running a version of Windows with support for hardware accelerated GPU scheduling. You may have even been part of our experimentation!

As we get under-development drivers from our GPU manufacturer partners, we published these drivers to an Insider version of Windows Update (WU) where distribution is limited to the Insider population. In the Insider Fast Ring, we can run experiments where we silently toggle hardware accelerated GPU scheduling on, on behalf of some users such that we get a mix of users running with and without the new scheduler.

Through our experimentation platform and our telemetry system we can effectively run A/B experiments and compare how systems running with hardware accelerated GPU scheduling compare to systems running our old GPU scheduler. We monitor reliability telemetry such as kernel crashes (bluescreens), user mode crashes, GPU hangs, freeze/deadlocks as well as a limited set of performance metrics.

Once a driver completes support for hardware accelerated scheduling and accumulates enough execution time in our Insider Pool to demonstrate its reliability and performance, it is allowed to be promoted to the public version of Windows Update where it becomes available to everyone running the supported hardware.

Why not have hardware accelerated GPU scheduling on by default for all users given all the care taken before a driver can expose this support? Although we do a lot of validation through our Insider population, the number of system configurations and scenarios in the Insider population does not fully cover what can happen in our eco-system of more than a billion devices. Because hardware accelerated GPU scheduling is such a fundamental pillar of the graphics subsystem and used in absolutely everything that you do on your PC, we decided to introduce it initially as an opt-in to avoid any possible disruption. Users can opt-in through the UI and for new systems, OEM are encouraged to configure and validate their system with hardware accelerated GPU scheduling turned on from the factory.

What to expect when switching to the new GPU scheduler

The transition should be transparent, and users should not notice any significant changes. Although the new scheduler reduces the overhead of GPU scheduling, most applications have been designed to hide scheduling costs through buffering.

The goal of the first phase of hardware accelerated GPU scheduling is to modernize a fundamental pillar of the graphics subsystem and to set the stage for things to come… but that’s going to be a story for a another time 😊.

We do not expect customers to experience performance regressions but if you encounter any, please be sure to file feedback at: https://aka.ms/submitgameperformancefeedback.

Thanks

Please give this a try and let us know what you think!

Category
DirectX

Author

Steve Pronovost
Partner Development Lead

Lead and architect for the Windows Graphics Kernel.

18 comments

Discussion is closed. Login to edit/delete existing comments.

  • Chandan Nataraj

    Hi Steve,

    Thanks for the article.
    Is this going to affect TCC mode? Say I have two windows applications that use CUDA and the GPU is set to use TCC mode. Is there a way to set priority for one of the process to use GPU over other when there is a high priority task?

  • R. K.

    Just so you know. With the activated GPU Scheduling you may expect that your multi gpu setup will not work anymore in blender 2.8.3 / cuda / cycles.
    Took me a long time to figure it out. Because this feature will be activated automatically without informing you. Please stop doing this Microsoft.

    My output with blender --debug-cycles:

    CUDA error: Launch failed in cuGraphicsResourceGetMappedPointer(&buffer, &bytes, pmem.cuPBOresource), line 2000

    Refer to the Cycles GPU rendering documentation for possible solutions:
    https://docs.blender.org/manual/en/latest/render/cycles/gpu_rendering.html

    CUDA error: Launch failed in cuModuleGetFunction(&cuFilmConvert, cuModule, "kernel_cuda_convert_to_half_float"), line 1865

    Read more
  • Gaganjot Singh

    That’s an interesting feature by Windows 10. Will it increase of hardware consumption on the GPU?

  • Rhuan Henrique de Paula Vitor

    I have an RX 5700 with Driver 20.7.2 and I can’t activate the Hardware Accelerated GPU Scheduling, neither the option to activate, anyone else with this problem?

  • Said Rahal Martínez

    After installing this update I begun to get a strange error. Suddenly the mouse pointer froze, so the screen, then the screen goes black like if it was turned off, then the screen comes to life as nothing. My GPU is brand new and this error begun only AFTER may update and activate GPU ACCELERATION new feature. The strange thing is that this only happens while WORKING with Firefox, using Explorer to drop files, so on. Never when playing, never when watching movies full screen. I have now disabled this feature and so far the strange freezing has gone! Please...

    Read more
  • Martin 345

    Glad to see even fundamental things like that getting an overhaul.

    Though it is not quite stable, definitely. I’ve had regular cases since I moved to v2004 with HAGS where the graphics just freeze. System still seems to be running, but without being able to see what you’re doing, it’s not very useful, so it requires a manual reset. I’m curious – does your telemetry account for these types of scenarios?

  • Michal Richter

    This new feature might be interresting for CUDA developers! My company currently ports our CUDA based software from Linux to Windows. I was looking for ways to reduce the WDDM overhead of CUDA kernel launches and found this article. Switching to Hardware-accelerated GPU scheduling did in our case considerably improve performance of our application (particularly when used with CUDA graphs).

    • Joe Stump

      Could you elaborate more on the improvement gained? Based on my account with TensorRT, there is a substantial performance regression regardless of HAGS being enabled or disabled. Apart from your testimonial, there is no other positive report on this (assuming your work revolves around CUDA for machine learning and not gaming), not even on https://forums.developer.nvidia.com/.

  • Fábio Radicchi Belotto

    Did anyone notice an increase of hardware consumption on the GPU?

    Now I am getting blottenecked when using raytracing.

  • Big Blue Frontend

    Sorry, but the communication on this feature was absolutely terrible. Instead of being coy and saying that the only thing people care about is a “story for another time,” how about fessing up NOW and just communicating why people should care about this.

    “We’re not going to tell you why you should give a crap about this bizarre, weirdo feature shrouded in mystery that we won’t elaborate on, but please, tell us what you think about it!”

    Okay, champ.

    • Nah Nood

      LOL right, just like “Game Mode.” haha

  • Matthew Faust

    Curious, buffering ahead a few frames was mentioned, would hardware scheduling improve performance more when something like AMD Anti-Lag or Nvidia Ultra Low Latency Mode is utilized? Those have historically been partially CPU bound, curious if that’s because of the dependency on the CPU scheduler.