# Project Austin Part 4 of 6: C++ AMP acceleration Hello, I am Amit Agarwal, a developer on the C++ AMP team. C++ AMP is a new technology available in Visual Studio 2012 that enables C++ developers to make the best use of available heterogeneous computing resources in their applications from within the same C++ sources and the VS IDE they use for programming the CPU. Austin is a digital note-taking app for Windows 8 and the visually engaging 3D effects associated with page turning in the Austin app are powered by the use of C++ AMP.

A page surface is modeled as a 3D mesh comprised of a collection of triangles each defined by the location of its vertices in 3 dimensions. The page turning animation involves a compute-intensive page curling algorithm comprised of two main steps:

1. Deformation of the page surface mesh, used to calculate vertex positions for each frame.
2. Calculating the vertex normals, subsequently used for applying shading to the page surface.

Both these steps are highly data parallel in nature and can be accelerated using C++ AMP to utilize the floating point arithmetic prowess of modern GPUs, hence improving the overall frame rate of the page turning animation. The page deformation step is currently implemented on the CPU; efforts to accelerate this step using C++ AMP are underway and we will talk about it in a future post.

In this blog post we will talk about accelerating the calculation of vertex normals using C++ AMP which is already part of the current version of Austin. But before we dive into the details, here is a picture depicting the page turning animation in Austin which is accelerated using C++ AMP. ## Introduction

Vertex normals are typically calculated as the normalized average of the surface normals of all triangles containing the vertex. Using this approach, computing the vertex normals on the CPU simply involves iterating over all triangles depicting the page surface and accumulating the triangle normals in the normals of the respective vertices.

In pseudo code:

`for each triangle`

`{`

`    …`

`    Position vertex1Pos = triangle.vertex1.position;`

`    Position vertex2Pos = triangle.vertex2.position;`

`    Position vertex3Pos = triangle.vertex3.position;`

` `

`    Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);`

` `

`    triangleNormal.normalize();`

` `

`    vertex1.normal += triangleNormal;`

`    vertex2.normal += triangleNormal;`

`    vertex3.normal += triangleNormal;`

`}`

## Accelerating vertex normals calculation using C++ AMP

As mentioned earlier, calculation of vertex normals is highly amenable to C++ AMP acceleration owing to its data parallel and compute intensive nature.

A simple starting point would be to replace the “for each triangle” loop in the CPU implementation with a C++ AMP parallel_for_each call. The compute domain of the parallel_for_each call is the number of triangles depicting the page, specified as an extent argument. In simple terms this can be thought of as launching as many threads on the accelerator as the number of triangles (typically several thousands) with each thread responsible for computing the surface normal for a triangle and accumulating the value in the normals of the triangle’s vertices. However a vertex is part of multiple triangles and since the parallel_for_each threads execute concurrently, multiple threads can potentially attempt to accumulate their respective triangle normals to the same vertex resulting in a race. One way to address this would be to synchronize the accumulation of each vertex’s normal by using C++ AMP atomic operations. Unfortunately, atomic operations are expensive on GPU accelerators and would be severely detrimental to the kernel’s performance.

A better alternative approach is to break the calculation of vertex normals into two steps:

1. Calculate the normal for each triangle.
2. For each vertex accumulate the normals from all triangles that the vertex is a part of and update the vertex normals after normalizing the accumulated value.

This approach comprises two parallel_for_each invocations. The first one launches as many GPU accelerator threads as there are triangles, with each thread computing the normal of a triangle from the positions of the triangle’s vertices. The triangle normal values are stored in a temporary intermediate concurrency::array_view which is subsequently used in the second stage for accumulating each vertex’s normal.

In pseudo code:

`parallel_for_each(extent<1>(triangleCount), [=](index<1> idx) restrict(amp) `

`{`

`    Position vertex1Pos = triangle.vertex1.position;`

`    Position vertex2Pos = triangle.vertex2.position;`

`    Position vertex3Pos = triangle.vertex3.position;`

` `

`    Normal triangleNormal = cross(vertex2Pos – vertex1Pos, vertex3Pos – vertex1Pos);`

` `

`    triangleNormal.normalize();`

` `

`    tempTriangleNormals[idx] = triangleNormal;`

`});`

The second parallel_for_each launches as many threads as the number of vertices on the page, with each thread accumulating the normals of triangles that the vertex is a part of, from the temporary array_view used to store the triangle normal in the first parallel_for_each. Thereafter, the accumulated vertex normal is normalized and stored in the vertex normal array_view.

In pseudo code:

`parallel_for_each(extent<2>(vertexCountY, vertexCountX), [=](index<2> idx) restrict(amp)`

`{`

`    // First get the existing vertex normal value`

`    Normal vertexNormal = vertexNormalView(idx);`

` `

`    // Each vertex is part of 4 quads with each quad comprising of 2 triangles`

`    // Based on the vertex position, it is determined which triangles the vertex is`

`    // part of and whether that triangle's normal should be accumulated in the vertex normal`

`    if (isVertexOfQuad1Triangle1) {`

`        vertexNormal += tempTriangleNormals(quad1Triangle1_index);`

`    }`

` `

`    if (isVertexOfQuad1Triangle2) {`

`        vertexNormal += tempTriangleNormals(quad1Triangle2_index);`

`    }`

` `

`    if (isVertexOfQuad2Triangle1) {`

`        vertexNormal += tempTriangleNormals(quad2Triangle1_index);`

`    }`

` `

`    if (isVertexOfQuad2Triangle2) {`

`        vertexNormal += tempTriangleNormals(quad2Triangle2_index);`

`    }`

` `

`    ...`

` `

`    vertexNormal.normalize();`

` `

`    vertexNormalView(idx) = vertexNormal;`

`});`

Finally, the normal components of each vertex in the DirectX vertex buffer are updated by reading out the contents of the vertex normal array_view on the CPU. The DirectX vertex buffer is now ready for rendering, with the vertex normal values used for shading the page.

The source code for Austin is freely available for download on CodePlex. The bits specific to C++ AMP acceleration of vertex normals computation are located in the class paper_sheet_node in the source file paper_sheet_node.hpp – the core C++ AMP acceleration code is in the function updateNormalsAmp and some C++ AMP specific initialization code is contained in the function ensureAmpInitialized.

## C++ AMP performance considerations

Having looked at the high-level approach to accelerating the vertex normal computation using C++ AMP, let us now dive deeper into details of the C++ AMP implementation that are important from a performance perspective.

### Struct of Arrays

Firstly, let us talk about the layout of input and output data accessed in the C++ AMP parallel_for_each kernels. The input of the first parallel_for_each invocation is an array_view of vertex positions, each position comprising of three single precision floating point values (x, y, z components). The output is an array_view of triangle normals, where each normal is again comprised of three floating point values (x, y, z components). The input and output of the 2nd parallel_for_each kernel are both array_views of normals.

The position and normal data is stored on the CPU as an array of structs. However, the GPU accelerator memory yields optimal bandwidth if consecutive threads access consecutive memory locations – an access pattern that is commonly referred to as Memory Coalescing in GPU computing parlance. Hence to ensure optimal memory access behavior, the layout of position and normal data on the GPU is adapted to be in the form of three arrays which hold the x, y and z components (of the vertex position or normal) respectively. Note that this is different from the CPU where the data is laid out in memory as an array of structs where each struct comprises of 3 floating point values.

### Persisting data in accelerator memory

The vertex normal values calculated in each frame are used in calculating the vertex normal values for the subsequent frame in the second parallel_for_each kernel. Consequently, it is beneficial to persist the vertex normal data in accelerator memory to be used for the vertex normal calculations in the subsequent frame instead of transferring the data from CPU memory in each frame.

### Using staging arrays for transferring data between CPU and accelerator memory

The vertex position data is transferred from CPU to accelerator memory in each frame. Also, after computing the vertex normals on the GPU accelerator, the vertex normals are transferred back to the vertex buffer in CPU memory to be used for shading. Additionally, as noted earlier, the layout of the vertex position and normal data in accelerator memory is in the form of struct of arrays instead of the array of struct layout in CPU memory. For optimal data transfer performance between the CPU and accelerator memory we employ staging arrays which are used to stage the change in data layout in CPU memory. For example, the vertex positions are copied from the vertex buffer to a staging array in a struct of arrays form and are subsequently copied to the CPU. Similarly, the vertex normal data that is laid out as struct of arrays in GPU memory is copied out to a staging array and is subsequently copied to the vertex buffer on the CPU in an array of struct form.

## Future improvements

A careful look at the two parallel_for_each kernels comprising the vertex normal calculation code using C++ AMP, reveals that both these kernels exhibit 2-D spatial locality of data accesses. For example, in the first parallel_for_each kernel, each thread loads the vertex position data for the vertices of its triangle and since neighboring triangles have common vertices, adjacent threads read the same vertex position data independently from accelerator global memory. Similarly, in the second parallel_for_each kernel, each vertex loads the triangle normal values of the triangles it is part of and since adjacent vertices are part of the same triangles, the same triangle normal values are independently read by adjacent threads from accelerator global memory.

The accelerator global memory has limited bandwidth and note that both C++ AMP kernels here are likely to be memory bound as described in this post on C++ AMP performance guidance. Consequently having multiple threads read the same data multiple times from accelerator global memory is wasteful. While GPU accelerators are designed to hide global memory access latency through heavy multithreading and fast switching between threads, it is generally advisable to optimize global memory accesses (for optimal global memory bandwidth utilization) by employing opportunities of data reuse between adjacent threads through fast on-chip tile_static accelerator memory. While the current implementation does not employ this technique, it is worth experimenting with the use of tile_static memory in this implementation — something we intend to do in the future.

The C++ AMP texture types are another form of global accelerator memory that is typically backed by caches designed for 2-D spatial locality and may be another alternative to using tile_static memory for leveraging the spatial 2-D data locality inherent in the C++ AMP acceleration kernels.

## In closing

In this post we looked at the approach of accelerating one of the compute intensive parts of the Austin application; viz. vertex normal calculations, using C++ AMP. While the actual gains obtained from C++ AMP acceleration depend on the available GPU hardware, if appropriately employed may yield orders of magnitude of improvement over CPU performance for compute intensive kernels in your applications. Also, in absence of DirectX 11 capable GPU hardware, C++ AMP employs a CPU fallback which uses your CPU’s multiple cores and SSE capabilities to accelerate the execution of your kernels. You can learn more about C++ AMP on the MSDN blog for Parallel Programming in Native Code.

We would love to hear your thoughts, comments, questions and feedback below or on the Parallel Programming in Native Code MSDN forum.

Posted in C++