This blog post describes C++ AMP remappable shader feature and the changes that it brings to the compilation/execution model in Visual Studio 2014. This feature improves C++ AMP code compilation speed without affecting runtime performance. We will provide data to show the improvements and cover the steps to utilize this feature.
To understand the advantage of remappable shader, I will start from the technology that it replaced. Previously, for a majority of parallel_for_each calls in C++ AMP programs, Visual C++ compiler generated two DirectX shaders (for each parallel_for_each call) which were eventually turned into device code through DirectX layer upon which Microsoft’s implementation of C++ AMP is built. The fact a parallel_for_each produced two shaders was a tradeoff between performance and program correctness due to potential resource aliasing (e.g., different array_view objects referring to overlapping memory locations). For example, in the following code snippet, the compiler cannot prove a0, a1, a2 and a3 refer to non-overlapping data since that information is only available at runtime.
void foo (array_view<T>& a0, array_view<T>& a1, array_view<T>& a2, array_view<T>& a3)
{
parallel_for_each(a0.extent, [&] (index<1> const idx) restrict(amp)
{
a0[idx] = 10;
a1[idx] = 15;
a2[idx] = a0[idx];
a3[idx] = a1[idx];
a0[idx] += a1[idx];
a1[idx] -= a2[idx];
a2[idx] *= a3[idx];
if (a0[idx]) { a3[idx] /= a0[idx]; }
});
}
As such, the compiler has to assume the worst aliasing pattern in code generation to guarantee program correctness, resulting in what we called the aliased shader. On the other hand, performance is a critical factor, which is why we also generated a non-aliased shader that, as its name suggests, assumed no aliasing existed among captured resources and has better performance characteristics. C++ AMP runtime picked up the best one depending on the aliasing pattern for each specific parallel_for_each invocation.
With the remappable shader feature, we only generate the non-aliased version during compilation. The runtime is now responsible for ensuring correct handling for different resource aliasing patterns. Therefore, it includes a second phase shader compilation to produce the best code for each specific aliasing pattern. The final shader code is also cached by the runtime so that a next invocation with the same pattern incurs no more compilation. Our measurements showed the additional runtime compilation has a negligible performance hit, while generating one less shader cuts the shader compilation time by half.
Exactly how this translates to visible compilation speedup depends on the complexity of the parallel_for_each kernel (including the entire call graph). For some interesting samples tested, we observed speedups ranging from 8% to 28% as summarized below.
|
Compilation Speedup |
Cartoonizer |
8% |
14% |
|
Sequence Alignment |
28% |
The compilation speedup represents end-to-end user experience in these examples. To enjoy the benefits of remappable shader, you need to compile your code with a Visual C++ compiler that implements this feature. Due to runtime shader compilation, C++ AMP takes dependency on D3DCompiler_47.dll which is present as a system component on Windows 8.1 and higher. For down level OSes, C++ AMP developers are required to ship D3DCompiler_47.dll. Please refer to DirectX SDK for further instructions.
0 comments