{"id":313,"date":"2014-09-08T14:18:09","date_gmt":"2014-09-08T14:18:09","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vcblog\/2014\/09\/08\/on-c-amp-remappable-shader\/"},"modified":"2019-02-18T18:05:14","modified_gmt":"2019-02-18T18:05:14","slug":"on-c-amp-remappable-shader","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/on-c-amp-remappable-shader\/","title":{"rendered":"On C++ AMP Remappable Shader"},"content":{"rendered":"<p><span style=\"font-family: helvetica;font-size: small\">This blog post describes C++ AMP remappable shader feature and the changes that it brings to the compilation\/execution model in Visual Studio 2014. This feature improves C++ AMP code compilation speed without affecting runtime performance. We will provide data to show the improvements and cover the steps to utilize this feature.<\/span><\/p>\n<p><span style=\"font-family: helvetica;font-size: small\">To understand the advantage of remappable shader, I will start from the technology that it replaced. Previously, for a majority of parallel_for_each calls in C++ AMP programs, Visual C++ compiler generated two DirectX shaders (for each parallel_for_each call) which were eventually turned into device code through DirectX layer upon which Microsoft&rsquo;s implementation of C++ AMP is built. The fact a parallel_for_each produced two shaders was a tradeoff between performance and program correctness due to potential resource aliasing (e.g., different array_view objects referring to overlapping memory locations). For example, in the following code snippet, the compiler cannot prove a0, a1, a2 and a3 refer to non-overlapping data since that information is only available at runtime.<\/span><\/p>\n<div>\n<div>\n<div>\n<p><span style=\"font-family: courier new,courier\">void foo (array_view&lt;T&gt;&amp; a0, array_view&lt;T&gt;&amp; a1, array_view&lt;T&gt;&amp; a2, array_view&lt;T&gt;&amp; a3)<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">{<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; parallel_for_each(a0.extent, [&amp;] (index&lt;1&gt; const idx) restrict(amp)<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; {<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a0[idx] = 10;<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a1[idx] = 15;<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a2[idx] = a0[idx];<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a3[idx] = a1[idx];<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a0[idx] += a1[idx];<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a1[idx] -= a2[idx];<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a2[idx] *= a3[idx];<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (a0[idx]) { a3[idx] \/= a0[idx]; }<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">&nbsp;&nbsp;&nbsp; });<\/span><\/p>\n<p><span style=\"font-family: courier new,courier\">}<\/span><\/p>\n<p><span style=\"font-family: helvetica;font-size: small\"><span style=\"line-height: 107%\">As such, the compiler has to assume the worst aliasing pattern in code generation to guarantee program correctness, resulting in what we called the aliased shader. On the other hand, performance is a critical factor, which is why we also generated a non-aliased shader that, as its name suggests, assumed no <\/span>aliasing existed among captured resources and has better performance characteristics. C++ AMP runtime picked up the best one depending on the aliasing pattern for each specific parallel_for_each invocation.<\/span><\/p>\n<p><span style=\"font-family: helvetica;font-size: small\">With the remappable shader feature, we only generate the non-aliased version during compilation. The runtime is now responsible for ensuring correct handling for different resource aliasing patterns. Therefore, it includes a second phase shader compilation to produce the best code for each specific aliasing pattern. The final shader code is also cached by the runtime so that a next invocation with the same pattern incurs no more compilation. Our measurements showed the additional runtime compilation has a negligible performance hit, while generating one less shader cuts the shader compilation time by half.<\/span><\/p>\n<p><span style=\"font-family: helvetica;font-size: small\">Exactly how this translates to visible compilation speedup depends on the complexity of the parallel_for_each kernel (including the entire call graph). For some interesting samples tested, we observed speedups ranging from 8% to 28% as summarized below.<\/span><span style=\"font-family: helvetica;font-size: small\"><\/span><\/p>\n<div align=\"center\">\n<table style=\"width: 352px\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td nowrap width=\"172\">\n<p>&nbsp;<\/p>\n<\/td>\n<td nowrap width=\"180\">\n<p align=\"center\">Compilation Speedup<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"172\">\n<p align=\"right\">Cartoonizer<\/p>\n<\/td>\n<td nowrap width=\"180\">\n<p align=\"center\">8%<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"172\">\n<p align=\"right\"><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/08\/16\/fluid-simulation-c-amp-sample.aspx\">Fluid Simulation<\/a><\/p>\n<\/td>\n<td nowrap width=\"180\">\n<p align=\"center\">14%<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td width=\"172\">\n<p align=\"right\">Sequence Alignment<\/p>\n<\/td>\n<td nowrap width=\"180\">\n<p align=\"center\">28%<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p><span style=\"font-family: helvetica;font-size: small\">The compilation speedup represents end-to-end user experience in these examples. To enjoy the benefits of remappable shader, you need to compile your code with a Visual C++ compiler that implements this feature. Due to runtime shader compilation, C++ AMP takes dependency on D3DCompiler_47.dll which is present as a system component on Windows 8.1 and higher. For down level OSes, C++ AMP developers are required to ship D3DCompiler_47.dll. Please refer to <a href=\"http:\/\/msdn.microsoft.com\/library\/windows\/desktop\/ee663275.aspx\">DirectX SDK<\/a> for further instructions.<\/span><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This blog post describes C++ AMP remappable shader feature and the changes that it brings to the compilation\/execution model in Visual Studio 2014. This feature improves C++ AMP code compilation speed without affecting runtime performance. We will provide data to show the improvements and cover the steps to utilize this feature. To understand the advantage [&hellip;]<\/p>\n","protected":false},"author":301,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[115],"class_list":["post-313","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","tag-c-amp"],"acf":[],"blog_post_summary":"<p>This blog post describes C++ AMP remappable shader feature and the changes that it brings to the compilation\/execution model in Visual Studio 2014. This feature improves C++ AMP code compilation speed without affecting runtime performance. We will provide data to show the improvements and cover the steps to utilize this feature. To understand the advantage [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/313","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/301"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=313"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/313\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=313"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=313"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=313"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}