It’s easy to get data into the GPU, but harder to get it out

Raymond Chen

Back in the old days, computer graphics were handled by the CPU by directly manipulating the frame buffer,¹ and the graphics card’s job was simply to put the pixels in the frame buffer onto the screen. As computer graphics technology has progressed, more and more work has been offloaded onto the GPU. Nowadays, the GPU draws triangles without CPU assistance, it has a z-buffer, it runs pixel shaders, it does texture mapping. The CPU shovels raw data into the GPU, and the GPU does the work of combining those pixels to form a final result, which is sent to the screen. All the work offloaded to the GPU means that the CPU is freed up for other things.

In order to accomplish this feat, graphics cards are designed so that the CPU can quickly pump data into the GPU, and the GPU has convenient, fast access to its memory. What was not optimized is getting data back out, so in practice, reading data out of the GPU is relative slow. This trade-off is typically a huge win, because the CPU rarely needs to look at the final result. Not caring about the final result also means that the CPU can use tricks like double-buffering and triple-buffering² to keep the graphics pipeline full.

What this means for you is that operations like GetPixel and BitBlt from the screen will be comparatively slow, because the data needs to be fetched out of the GPU’s frame buffer.

In practice, the frame buffer for what’s on the screen right now may not even exist. Once the frame is presented, the CPU reuses that frame buffer to compose the next frame. This means that reading from the frame buffer is even slower: The CPU first has to regenerate the frame buffer, so it can read the desired pixel or pixels from it. This means that a single GetPixel call on the screen DC is no longer just reading four bytes of memory from a frame buffer. Instead, it has to run a full render pass over the screen in order to figure out what’s there, and only then can it fetch four bytes from the frame buffer to read the pixel you’re interested in.

So try not to read from the screen DC. It’ll be really slow because the compositor first has to regenerate the screen contents before it can give you the pixel you want. Look for alternatives. For example, instead of reading the pixel in order to do alpha blending, use the AlphaBlend function to offload the work to the GPU. Or use a layered window with an alpha channel. Both of these change “reading data from the GPU” (which is slow) to “pushing data into the GPU” (which is fast). If you absolutely must read from the screen DC, then do it in bulk with BitBlt. The cost of reading from the screen is in the render pass. If you read 100 pixels one at a time, that’s 100 render passes, each of which produces a single pixel. Much better is to read 100 pixels with a single BitBlt. That way, you pay for only one render pass to get all your pixels.

¹ In the really old days, computer graphics were handled by the CPU feeding pixels to the output device in real time, known as “racing the beam“.

² There was a time when double-buffering and triple-buffering had the pejorative names “trouble-buffering” and “cripple-buffering”.

Author

Raymond Chen

Raymond has been involved in the evolution of Windows for more than 30 years. In 2003, he began a Web site known as The Old New Thing which has grown in popularity far beyond his wildest imagination, a development which still gives him the heebie-jeebies. The Web site spawned a book, coincidentally also titled The Old New Thing (Addison Wesley 2007). He occasionally appears on the Windows Dev Docs Twitter account to tell stories which convey no useful information.

5 comments

Discussion is closed. Login to edit/delete existing comments.

Alexis Ryan January 2, 2022 · Edited

I made a big mistake in a game engine once trying to do pixel blending on the cpu with the frame buffer in vram. Writing was nice and fast but reading the pixels back to blend them was so incredibly slow. Switching to a frame buffer in ram instead of vram lead to expected fast results. had originally used a buffer in ram but switched to vram as it made normal rendering slightly faster.
- B Eden January 2, 2022
  
  I did exactly the same with DirectDraw many moons ago…
switchdesktopwithfade@hotmail.com December 24, 2021

The hardest thing I ever deal with is the fact that DirectX surfaces disappear out of nowhere rather than be swapped or paged out. Poof! That buffer you painstakingly built up is just GONE. It takes all the fun out of graphics programming to have to manage that.

The second hardest thing I deal with is that "open source" has become a sloppy industry euphemism for "documentation". Nobody is reading or writing good books anymore so the deep domain knowledge is lost. Increasingly incompetent people who are dispassionate about patterns and are averse to reading source code are the ones reinventing...
Read more
The hardest thing I ever deal with is the fact that DirectX surfaces disappear out of nowhere rather than be swapped or paged out. Poof! That buffer you painstakingly built up is just GONE. It takes all the fun out of graphics programming to have to manage that.

The second hardest thing I deal with is that “open source” has become a sloppy industry euphemism for “documentation”. Nobody is reading or writing good books anymore so the deep domain knowledge is lost. Increasingly incompetent people who are dispassionate about patterns and are averse to reading source code are the ones reinventing the wheel again and again. Example: the Swift language, the Windows 11 shell, etc.

Read less
Péter Major December 23, 2021

In my experience DesktopDuplication and GraphicsCapture APIs especially are a lot faster. I use these APIs to drive my LED backlight app (see the video on GitHub). Graphics Capture is basically zero overhead, a lot better than Desktop Duplication, especially for games at 4K at 60Hz or above.
- Adam Walling December 29, 2021
  
  GraphicsCapture sounds interesting. I’ve had trouble with the desktop duplication API working consistently.