{"id":2113,"date":"2012-10-11T11:00:00","date_gmt":"2012-10-11T11:00:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/vcblog\/2012\/10\/11\/project-austin-part-4-of-6-c-amp-acceleration\/"},"modified":"2021-10-01T14:17:04","modified_gmt":"2021-10-01T14:17:04","slug":"project-austin-part-4-of-6-c-amp-acceleration","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/cppblog\/project-austin-part-4-of-6-c-amp-acceleration\/","title":{"rendered":"Project Austin Part 4 of 6: C++ AMP acceleration"},"content":{"rendered":"<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Hello, I am Amit Agarwal, a developer on the C++ AMP team. <a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2011\/09\/13\/c-amp-in-a-nutshell.aspx\"><span style=\"color: #0000ff\">C++ AMP<\/span><\/a> is a new technology available in Visual Studio 2012 that enables C++ developers to make the best use of available heterogeneous computing resources in their applications from within the same C++ sources and the VS IDE they use for programming the CPU. <\/span><\/span><a href=\"http:\/\/blogs.msdn.com\/b\/vcblog\/archive\/2012\/09\/11\/10348466.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">Austin<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> is a digital note-taking app for Windows 8 and the visually engaging 3D effects associated with <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/vcblog\/archive\/2012\/09\/12\/10348494.aspx?wa=wsignin1.0\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">page turning in the Austin app<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> are powered by the use of C++ AMP. <\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">A page surface is modeled as a 3D mesh comprised of a collection of triangles each defined by the location of its vertices in 3 dimensions. The page turning animation involves a compute-intensive page curling algorithm comprised of two main steps:<\/span><\/span><\/p>\n<ol>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Deformation of the page surface mesh, used to calculate vertex positions for each frame.<\/span><\/span><\/li>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Calculating the vertex normals, subsequently used for applying shading to the page surface.<\/span><\/span><\/li>\n<\/ol>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Both these steps are highly data parallel in nature and can be accelerated using C++ AMP to utilize the floating point arithmetic prowess of modern GPUs, hence improving the overall frame rate of the page turning animation. The page deformation step is currently implemented on the CPU; efforts to accelerate this step using C++ AMP are underway and we will talk about it in a future post.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">In this blog post we will talk about accelerating the calculation of vertex normals using C++ AMP which is already part of the current version of Austin. But before we dive into the details, here is a picture depicting the page turning animation in Austin which is accelerated using C++ AMP.<\/span><\/span><\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2012\/10\/7317.image_thumb_7A41CE48.png\"><img decoding=\"async\" src=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2012\/10\/7317.image_thumb_7A41CE48.png\" alt=\"Image 7317 image thumb 7A41CE48\" width=\"244\" height=\"236\" class=\"aligncenter size-full wp-image-28961\" srcset=\"https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2012\/10\/7317.image_thumb_7A41CE48.png 244w, https:\/\/devblogs.microsoft.com\/cppblog\/wp-content\/uploads\/sites\/9\/2012\/10\/7317.image_thumb_7A41CE48-24x24.png 24w\" sizes=\"(max-width: 244px) 100vw, 244px\" \/><\/a><\/p>\n<h2>Introduction<\/h2>\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Vertex_normal\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">Vertex normals<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> are typically calculated as the <\/span><a href=\"http:\/\/en.wikipedia.org\/wiki\/Unit_vector\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">normalized<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> average of the surface normals of all triangles containing the vertex. Using this approach, computing the vertex normals on the CPU simply involves iterating over all triangles depicting the page surface and accumulating the triangle normals in the normals of the respective vertices.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">In pseudo code:<\/span><\/span><\/p>\n<div id=\"codeSnippetWrapper\">\n<div id=\"codeSnippet\" style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\"><span style=\"color: #0000ff\">for<\/span> each triangle<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">{<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    &hellip;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex1Pos = triangle.vertex1.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex2Pos = triangle.vertex2.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex3Pos = triangle.vertex3.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Normal triangleNormal = cross(vertex2Pos &ndash; vertex1Pos, vertex3Pos &ndash; vertex1Pos);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    triangleNormal.normalize();<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    vertex1.normal += triangleNormal;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    vertex2.normal += triangleNormal;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    vertex3.normal += triangleNormal;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">}<\/pre>\n<p><!--CRLF--><\/div>\n<\/div>\n<h2>Accelerating vertex normals calculation using C++ AMP<\/h2>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">As mentioned earlier, calculation of vertex normals is highly amenable to C++ AMP acceleration owing to its data parallel and compute intensive nature. <\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">A simple starting point would be to replace the &ldquo;<em>for each triangle<\/em>&rdquo; loop in the CPU implementation with a C++ AMP <\/span><a href=\"http:\/\/www.danielmoth.com\/Blog\/parallelforeach-From-Amph-Part-1.aspx\"><em><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">parallel_for_each<\/span><\/em><\/a><span style=\"font-family: Calibri;font-size: small\"> call. The compute domain of the <em>parallel_for_each<\/em> call is the number of triangles depicting the page, specified as an <\/span><a href=\"http:\/\/www.danielmoth.com\/Blog\/concurrencyextent-From-Amph.aspx\"><em><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">extent<\/span><\/em><\/a><span style=\"font-family: Calibri;font-size: small\"> argument. In simple terms this can be thought of as launching as many threads on the <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/02\/02\/default-accelerator-in-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">accelerator<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> as the number of triangles (typically several thousands) with each thread responsible for computing the surface normal for a triangle and accumulating the value in the normals of the triangle&rsquo;s vertices. However a vertex is part of multiple triangles and since the <em>parallel_for_each<\/em> threads execute concurrently, multiple threads can potentially attempt to accumulate their respective triangle normals to the same vertex resulting in a race. One way to address this would be to synchronize the accumulation of each vertex&rsquo;s normal by using <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/01\/04\/c-amp-s-atomic-operations.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">C++ AMP atomic operations<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\">. Unfortunately, atomic operations are expensive on GPU accelerators and would be severely detrimental to the kernel&rsquo;s performance.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">A better alternative approach is to break the calculation of vertex normals into two steps:<\/span><\/span><\/p>\n<ol>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Calculate the normal for each triangle.<\/span><\/span><\/li>\n<li><span style=\"font-size: small\"><span style=\"font-family: Calibri\">For each vertex accumulate the normals from all triangles that the vertex is a part of and update the vertex normals after normalizing the accumulated value.<\/span><\/span><\/li>\n<\/ol>\n<p><span style=\"font-family: Calibri;font-size: small\">This approach comprises two <em>parallel_for_each<\/em> invocations. The first one launches as many GPU accelerator threads as there are triangles, with each thread computing the normal of a triangle from the positions of the triangle&rsquo;s vertices. The triangle normal values are stored in a temporary intermediate <\/span><a href=\"http:\/\/www.danielmoth.com\/Blog\/array-And-Arrayview-From-Amph.aspx\"><em><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">concurrency::array_view<\/span><\/em><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> which is subsequently used in the second stage for accumulating each vertex&rsquo;s normal.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">In pseudo code:<\/span><\/span><\/p>\n<div id=\"codeSnippetWrapper\">\n<div id=\"codeSnippet\" style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">parallel_for_each(extent&lt;1&gt;(triangleCount), [=](index&lt;1&gt; idx) restrict(amp) <\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">{<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex1Pos = triangle.vertex1.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex2Pos = triangle.vertex2.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Position vertex3Pos = triangle.vertex3.position;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Normal triangleNormal = cross(vertex2Pos &ndash; vertex1Pos, vertex3Pos &ndash; vertex1Pos);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    triangleNormal.normalize();<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    tempTriangleNormals[idx] = triangleNormal;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">});<\/pre>\n<p><!--CRLF--><\/div>\n<\/div>\n<p><span style=\"font-family: Calibri;font-size: small\">The second <em>parallel_for_each <\/em>launches as many threads as the number of vertices on the page, with each thread accumulating the normals of triangles that the vertex is a part of, from the temporary <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/07\/23\/concurrency-array-view-introduction.aspx\"><em><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">array_view<\/span><\/em><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> used to store the triangle normal in the first<em> parallel_for_each<\/em>. Thereafter, the accumulated vertex normal is normalized and stored in the vertex normal <em>array_view<\/em>.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">In pseudo code:<\/span><\/span><\/p>\n<div id=\"codeSnippetWrapper\">\n<div id=\"codeSnippet\" style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">parallel_for_each(extent&lt;2&gt;(vertexCountY, vertexCountX), [=](index&lt;2&gt; idx) restrict(amp)<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">{<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #008000\">\/\/ First get the existing vertex normal value<\/span><\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    Normal vertexNormal = vertexNormalView(idx);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #008000\">\/\/ Each vertex is part of 4 quads with each quad comprising of 2 triangles<\/span><\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #008000\">\/\/ Based on the vertex position, it is determined which triangles the vertex is<\/span><\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #008000\">\/\/ part of and whether that triangle's normal should be accumulated in the vertex normal<\/span><\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #0000ff\">if<\/span> (isVertexOfQuad1Triangle1) {<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">        vertexNormal += tempTriangleNormals(quad1Triangle1_index);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    }<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #0000ff\">if<\/span> (isVertexOfQuad1Triangle2) {<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">        vertexNormal += tempTriangleNormals(quad1Triangle2_index);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    }<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #0000ff\">if<\/span> (isVertexOfQuad2Triangle1) {<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">        vertexNormal += tempTriangleNormals(quad2Triangle1_index);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    }<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    <span style=\"color: #0000ff\">if<\/span> (isVertexOfQuad2Triangle2) {<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">        vertexNormal += tempTriangleNormals(quad2Triangle2_index);<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    }<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    ...<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    vertexNormal.normalize();<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">&nbsp;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: white;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">    vertexNormalView(idx) = vertexNormal;<\/pre>\n<p><!--CRLF--><\/p>\n<pre style=\"text-align: left;line-height: 12pt;background-color: #f4f4f4;margin: 0em;width: 100%;font-family: 'Courier New', courier, monospace;direction: ltr;color: black;font-size: 8pt;overflow: visible;border-style: none;padding: 0px\">});<\/pre>\n<p><!--CRLF--><\/div>\n<\/div>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Finally, the normal components of each vertex in the DirectX vertex buffer are updated by reading out the contents of the vertex normal <em>array_view<\/em> on the CPU. The DirectX vertex buffer is now ready for rendering, with the vertex normal values used for shading the page.<\/span><\/span><\/p>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">The source code for Austin is freely <a href=\"http:\/\/austin.codeplex.com\">available for download&nbsp;on CodePlex<\/a>. The bits specific to C++ AMP acceleration of vertex normals computation are located in the class <em>paper_sheet_node<\/em> in the source file <em>paper_sheet_node.hpp &ndash; <\/em>the core C++ AMP acceleration code is in the function <em>updateNormalsAmp<\/em> and some C++ AMP specific initialization code is contained in the function <em>ensureAmpInitialized.<\/em><\/span><\/span><\/p>\n<h2>C++ AMP performance considerations<\/h2>\n<p><span style=\"font-family: Calibri\"><span style=\"font-size: small\">Having looked at the high-level approach to accelerating the vertex normal computation using C++ AMP, let us now dive deeper into details of the C++ AMP implementation that are important from a performance perspective.<\/span><\/span><\/p>\n<h3><span style=\"color: #4f81bd\"><span style=\"font-family: Cambria\"><span style=\"font-size: small\">Struct of Arrays <\/span><\/span><\/span><\/h3>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">Firstly, let us talk about the layout of input and output data accessed in the C++ AMP <em>parallel_for_each <\/em>kernels. The input of the first <em>parallel_for_each<\/em> invocation is an <em>array_view<\/em> of vertex positions, each position comprising of three single precision floating point values (x, y, z components). The output is an array_view of triangle normals, where each normal is again comprised of three floating point values (x, y, z components). The input and output of the 2<sup>nd<\/sup> parallel_for_each kernel are both array_views of normals. <\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">The position and normal data is stored on the CPU as an array of structs. However, the GPU accelerator memory yields optimal bandwidth if consecutive threads access consecutive memory locations &ndash; an access pattern that is commonly referred to as <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/08\/10\/memory-coalescing-with-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">Memory Coalescing<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> in GPU computing parlance. Hence to ensure optimal memory access behavior, the layout of position and normal data on the GPU is adapted to be in the form of three arrays which hold the x, y and z components (of the vertex position or normal) respectively. Note that <\/span><span style=\"font-family: Calibri;font-size: small\">this is different<\/span><span style=\"font-family: Calibri\">&nbsp;<\/span><span style=\"font-size: small\"><span style=\"font-family: Calibri\">from the CPU where the data is laid out in memory as an array of structs where each struct comprises of 3 floating point values. <\/span><\/span><\/p>\n<h3><span style=\"color: #4f81bd\"><span style=\"font-family: Cambria\"><span style=\"font-size: small\">Persisting data in accelerator memory<\/span><\/span><\/span><\/h3>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">The vertex normal values calculated in each frame are used in calculating the vertex normal values for the subsequent frame in the second <em>parallel_for_each<\/em> kernel. Consequently, it is beneficial to persist the vertex normal data in accelerator memory to be used for the vertex normal calculations in the subsequent frame instead of transferring the data from CPU memory in each frame.<\/span><\/span><\/p>\n<h3><span style=\"color: #4f81bd\"><span style=\"font-family: Cambria\"><span style=\"font-size: small\">Using staging arrays for transferring data between CPU and accelerator memory<\/span><\/span><\/span><\/h3>\n<p><span style=\"font-family: Calibri;font-size: small\">The vertex position data is transferred from CPU to accelerator memory in each frame. Also, after computing the vertex normals on the GPU accelerator, the vertex normals are transferred back to the vertex buffer in CPU memory to be used for shading. Additionally, as noted earlier, the layout of the vertex position and normal data in accelerator memory is in the form of struct of arrays instead of the array of struct layout in CPU memory. For optimal data transfer performance between the CPU and accelerator memory we employ <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2011\/11\/10\/staging-arrays-in-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">staging arrays<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> which are used to stage the change in data layout in CPU memory. For example, the vertex positions are copied from the vertex buffer to a staging array in a struct of arrays form and are subsequently copied to the CPU. Similarly, the vertex normal data that is laid out as struct of arrays in GPU memory is copied out to a staging array and is subsequently copied to the vertex buffer on the CPU in an array of struct form.<\/span><\/span><\/p>\n<h2>Future improvements<\/h2>\n<p><span style=\"font-size: small\"><span style=\"font-family: Calibri\">A careful look at the two <em>parallel_for_each<\/em> kernels comprising the vertex normal calculation code using C++ AMP, reveals that both these kernels exhibit 2-D spatial locality of data accesses. For example, in the first <em>parallel_for_each<\/em> kernel, each thread loads the vertex position data for the vertices of its triangle and since neighboring triangles have common vertices, adjacent threads read the same vertex position data independently from accelerator global memory. Similarly, in the second <em>parallel_for_each<\/em> kernel, each vertex loads the triangle normal values of the triangles it is part of and since adjacent vertices are part of the same triangles, the same triangle normal values are independently read by adjacent threads from accelerator global memory. <\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">The accelerator global memory has limited bandwidth and note that both C++ AMP kernels here are likely to be memory bound as described in this post on <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/07\/26\/performance-guidance-for-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">C++ AMP performance guidance<\/span><\/a><span style=\"font-family: Calibri\"><span style=\"font-size: small\">.<\/span> <span style=\"font-size: small\">Consequently having multiple threads read the same data multiple times from accelerator global memory is wasteful. While GPU accelerators are designed to hide global memory access latency through heavy multithreading and fast switching between threads, it is generally advisable to optimize global memory accesses (for optimal global memory bandwidth utilization) by employing opportunities of data reuse between adjacent threads through fast on-chip <\/span><\/span><a href=\"http:\/\/www.danielmoth.com\/Blog\/tilestatic-Tilebarrier-And-Tiled-Matrix-Multiplication-With-C-AMP.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">tile_static<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> accelerator memory. While the current implementation does not employ this technique, it is worth experimenting with the use of <em>tile_static<\/em> memory in this implementation &#8212; something we intend to do in the future.<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">The <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/04\/04\/getting-started-with-textures-in-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">C++ AMP texture<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\"> types are another form of global accelerator memory that is typically backed by caches designed for 2-D spatial locality and may be another alternative to using <em>tile_static<\/em> memory for leveraging the spatial 2-D data locality inherent in the C++ AMP acceleration kernels.<\/span><\/span><\/p>\n<h2>In closing<\/h2>\n<p><span style=\"font-family: Calibri;font-size: small\">In this post we looked at the approach of accelerating one of the compute intensive parts of the Austin application; viz. vertex normal calculations, using C++ AMP. While the actual gains obtained from C++ AMP acceleration depend on the available GPU hardware, if appropriately employed may yield orders of magnitude of improvement over CPU performance for compute intensive kernels in <\/span><span style=\"font-family: Calibri;font-size: small\">your applications<\/span><span style=\"font-family: Calibri;font-size: small\">. Also, in absence of DirectX 11 capable GPU hardware, C++ AMP employs a <\/span><a href=\"http:\/\/www.danielmoth.com\/Blog\/Running-C-AMP-Kernels-On-The-CPU.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">CPU fallback<\/span><\/a><span style=\"font-family: Calibri;font-size: small\"> which uses your CPU&rsquo;s multiple cores and SSE capabilities to accelerate the execution of your kernels. You can learn more about C++ AMP on the MSDN blog for <\/span><a href=\"http:\/\/blogs.msdn.com\/b\/nativeconcurrency\/archive\/2012\/04\/04\/getting-started-with-textures-in-c-amp.aspx\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">Parallel Programming in Native Code<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\">.<\/span><\/span><\/p>\n<p><span style=\"font-family: Calibri;font-size: small\">We would love to hear your thoughts, comments, questions and feedback below or on the <\/span><a href=\"http:\/\/social.msdn.microsoft.com\/Forums\/en-US\/parallelcppnative\/threads\"><span style=\"color: #0000ff;font-family: Calibri;font-size: small\">Parallel Programming in Native Code MSDN forum<\/span><\/a><span style=\"font-size: small\"><span style=\"font-family: Calibri\">.<\/span><\/span><span style=\"font-family: Calibri;font-size: x-small\">&nbsp;<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello, I am Amit Agarwal, a developer on the C++ AMP team. C++ AMP is a new technology available in Visual Studio 2012 that enables C++ developers to make the best use of available heterogeneous computing resources in their applications from within the same C++ sources and the VS IDE they use for programming the [&hellip;]<\/p>\n","protected":false},"author":286,"featured_media":35994,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1],"tags":[158,100,101,80],"class_list":["post-2113","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cplusplus","tag-austin","tag-c-language","tag-gpu","tag-parallelism"],"acf":[],"blog_post_summary":"<p>Hello, I am Amit Agarwal, a developer on the C++ AMP team. C++ AMP is a new technology available in Visual Studio 2012 that enables C++ developers to make the best use of available heterogeneous computing resources in their applications from within the same C++ sources and the VS IDE they use for programming the [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/2113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/users\/286"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/comments?post=2113"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/posts\/2113\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media\/35994"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/media?parent=2113"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/categories?post=2113"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/cppblog\/wp-json\/wp\/v2\/tags?post=2113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}