Coming to DirectX 12— Mesh Shaders and Amplification Shaders: Reinventing the Geometry Pipeline  


D3D12 is adding two new shader stages: the Mesh Shader and the Amplification Shader. These additions will streamline the rendering pipeline, while simultaneously boosting flexibility and efficiency.  In this new and improved pre-rasterization pipeline, Mesh and Amplification Shaders will optionally replace the section of the pipeline consisting of the Input Assembler as well as Vertex, Geometry, Domain, and Hull Shaders with richer and more general purpose capabilitiesThis is possible through a reimagination of how geometry is processed.  


What does the geometry pipeline look like now?

How can we fix it?

How do Mesh Shaders work?

What does an Amplification Shader do?

What exactly is a meshlet?

Now that I’m sold, how do I build a Mesh Shader?

How to build an Amplification Shader

Calling shaders in the runtime

Getting Started

What does the geometry pipeline look like now?  

In current pipelines, geometry is processed whole. This means that for a mesh with hundreds of millions of triangles, all the values in the index buffer need to be processed in order, and all the vertices of a triangle must be processed before even culling can occur. Although not all geometry is that dense, we live in a world of increasing complexity, where users want more detail without sacrificing on speed. This means that a pipeline with a linear bottleneck like the index buffer is unsustainable.  

Additionally, the process is rigid. Because of the use of the index buffer, all index data must be 16 or 32 bits in size, and a single index value applies to all the vertex attributes at once. Options for compressing geometry data are limited.  Culling can be performed by software at the level of an entire draw call, or by hardware on a per-primitive basis only after all the vertices of a primitive have been shaded, but there are no in-between options. These are all requirements that can limit how much a developer is able to do. For example, what if you want to store separate bounding boxes for pieces of a larger mesh, then frustum cull each piece individually, or split up a mesh into groups of triangles that share similar normals, so an entire backfacing triangle group can be rejected up-front by a single test?  How about moving per-triangle backface tests as early as possible in the geometry pipeline, which could allow skipping the cost of fetching vertex attributes for rejected triangles?  Or implementing conservative animation-aware bounding box culling for small chunks of a mesh, which could run before the expensive skinning computations.  With mesh shaders, these choices are entirely under your control. 

How can we fix this? 

In fact, we’re not going to try. Mesh Shaders are not putting a band-aid onto a system that’s struggling to keep up. Instead, they are reinventing the pipeline. By using a compute programming model, the Mesh Shader can process chunks of the mesh, which we call “meshlets”, in parallel. The threads that process each meshlet can work together using groupshared memory to read whatever format of input data they choose in whatever way they like, process the geometry, then output a small indexed primitive list. This means no more linear iterating through the entire mesh, and no limits imposed by the more rigid structure of previous shader stages.  

How do Mesh Shaders work?  

A Mesh Shader begins its work by dispatching a set of threadgroups, each of which processes a subset of the larger mesh. Each threadgroup has access to groupshared memory like compute shaders, but outputs vertices and primitives that do not have to correlate with a specific thread in the group. As long as the threadgroup processes all vertices associated with the primitives in the threadgroup, resources can be allocated in whatever way is most efficient. Additionally, the Mesh Shader outputs both per-vertex and per-primitive attributes, which allows the user to be more precise and space efficient.  

What does an Amplification Shader do? 

While the Mesh Shader is a fairly flexible tool, it does not allow for all tessellation scenarios and is not always the most efficient way to implement per-instance culling. For this we have the Amplification Shader. What it does is simple: dispatch threadgroups of Mesh Shaders. Each Mesh Shader has access to the data from the parent Amplification Shader and does not return anything. The Amplification Shader is optional, and also has access to groupshared memory, making it a powerful tool to allow the Mesh Shader to replace any current pipeline scenario.  

What exactly is a Meshlet?  

A meshlet is a subset of a mesh created through an intentional partition of the geometry. Meshlets should be somewhere in the range of 32 to around 200 vertices, depending on the number of attributes, and will have as many shared vertices as possible to allow for vertex re-use during rendering. This partitioning will be pre-computed and stored with the geometry to avoid computation at runtimeunlike the current Input Assembler which must attempt to dynamically identify vertex reuse every time a mesh is drawn. Titles can convert meshlets into regular index buffers for vertex shader fallback if a device does not support Mesh Shaders. 

Now that I’m sold, how do I use this feature? 

Building a Mesh Shader is fairly simple.  

You must specify the number of threads in your thread group using  

[ numthreads ( X, Y, Z ) ]

And the type of primitive being used with  

[ outputtopology ( T ) ]

The Mesh Shader can take a number of system values as inputs, including SV_DispatchThreadID , SV_GroupThreadID , SV_ViewID and more, but must output an array for vertices and one for primitives. These are the arrays that you will write to at the end of your computations. If the Mesh Shader is attached to an Amplification Shader, it must also have an input for the payload. The final requirement is that you must set the number of primitives and vertices that the Mesh Shader will export. You do this by calling  

SetMeshOutputCounts ( uint numVertices, uint numPrimatives )

This function must be called exactly once in the Mesh Shader before the output arrays are written to. If this does not happen, the Mesh Shader will not output any data.  

Beyond these rules, there is so much flexibility in what you can do. Here is an example Mesh Shader, but more information and examples can be found in the spec.

#define MAX_MESHLET_SIZE 128 
#define ROOT_SIG "CBV(b0), \ 
    CBV(b1), \ 
    CBV(b2), \ 
    SRV(t0), \ 
    SRV(t1), \ 
    SRV(t2), \ 
struct Meshlet 
    uint32_t VertCount; 
    uint32_t VertOffset; 
    uint32_t PrimCount; 
    uint32_t PrimOffset; 
    DirectX::XMFLOAT3 AABBMin; 
    DirectX::XMFLOAT3 AABBMax; 
    DirectX::XMFLOAT4 NormalCone; 
struct MeshInfo 
    uint32_t IndexBytes; 
    uint32_t MeshletCount; 
    uint32_t LastMeshletSize; 
ConstantBuffer<Constants>   Constants : register(b0); 
ConstantBuffer<Instance>    Instance : register(b1); 
ConstantBuffer<MeshInfo>    MeshInfo : register(b2); 
StructuredBuffer<Vertex>    Vertices : register(t0); 
StructuredBuffer<Meshlet>   Meshlets : register(t1); 
ByteAddressBuffer           UniqueVertexIndices : register(t2); 
StructuredBuffer<uint>      PrimitiveIndices : register(t3);
uint3 GetPrimitive(Meshlet m, uint index) 
    uint3 primitiveIndex = PrimitiveIndices[m.PrimOffset + index]); 
    return uint3(primitiveIndex & 0x3FF, (primitiveIndex >> 10) & 0x3FF, (primitiveIndex >> 20) & 0x3FF);  
uint GetVertexIndex(Meshlet m, uint localIndex) 
    localIndex = m.VertOffset + localIndex; 
    if (MeshInfo.IndexBytes == 4) // 32-bit Vertex Indices 
        return UniqueVertexIndices.Load(localIndex * 4); 
    else // 16-bit Vertex Indices 
        // Byte address must be 4-byte aligned. 
        uint wordOffset = (localIndex & 0x1); 
        uint byteOffset = (localIndex / 2) * 4; 
        // Grab the pair of 16-bit indices, shift & mask off proper 16-bits. 
        uint indexPair = UniqueVertexIndices.Load(byteOffset); 
        uint index = (indexPair >> (wordOffset * 16)) & 0xffff; 
        return index; 
VertexOut GetVertexAttributes(uint meshletIndex, uint vertexIndex) 
    Vertex v = Vertices[vertexIndex]; 

    float4 positionWS = mul(float4(v.Position, 1), Instance.World); 
    VertexOut vout; 
    vout.PositionVS   = mul(positionWS, Constants.View).xyz; 
    vout.PositionHS   = mul(positionWS, Constants.ViewProj); 
    vout.Normal       = mul(float4(v.Normal, 0), Instance.WorldInvTrans).xyz; 
    vout.MeshletIndex = meshletIndex; 
    return vout; 


[NumThreads(GROUP_SIZE, 1, 1)] 
void main( 
    uint gtid : SV_GroupThreadID, 
    uint gid : SV_GroupID, 
    out indices uint3 tris[MAX_MESHLET_SIZE], 
    out vertices VertexOut verts[MAX_MESHLET_SIZE] 
    Meshlet m = Meshlets[gid]; 
    SetMeshOutputCounts(m.VertCount, m.PrimCount); 
    if (gtid < m.PrimCount) 
        tris[gtid] = GetPrimitive(m, gtid); 
    if (gtid < m.VertCount) 
        uint vertexIndex = GetVertexIndex(m, gtid); 
        verts[gtid] = GetVertexAttributes(gid, vertexIndex); 

How to build an Amplification Shader 

Amplification Shaders are similarly easy to start using. If you choose to use an Amplification Shader, you only have to specify the number of threads per group, using  

[ numthreads ( X, Y, Z ) ]

You may issue 0 or 1 calls to dispatch your Mesh Shaders using 

DispatchMesh ( ThreadGroupCount X, ThreadGroupCountY, ThreadGroupCountZ, MeshPayload )

Beyond this, you can choose to use groupshared memory, and the rest is up to your creativity on how to leverage this feature in the best way for your project. Here is a simple example to get you started:  

struct payloadStruct
    uint myArbitraryData; 
void AmplificationShaderExample(in uint3 groupID : SV_GroupID)    
    payloadStruct p; 
    p.myArbitraryData = groupID.z; 

Calling Shaders in the Runtime  

To use Mesh Shaders on the API side, make sure to call CheckFeatureSupport as follows to ensure that Mesh Shaders are available on your device:  

D3D12_FEATURE_DATA_D3D12_OPTIONS7 featureData = {};  
pDevice->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS7, &featureData, sizeof(featureData)); 
If ( featureData.MeshShaderTier >= D3D12_MESH_SHADER_TIER_1 ) { 
  //Supported Mesh Shader Use 

Additionally, the Pipeline State Object must be compliant with the restrictions of Mesh Shaders, meaning that no incompatible shaders can be attached (Vertex, Geometry, Hull, or Domain), IA and streamout must be disabled, and your pixel shader, if provided, must be DXIL. Shaders can be attached to a D3D12_PIPELINE_STATE_STREAM_DESC  struct with the types CD3DX12_PIPELINE_STATE_STREAM_AS  and CD3DX12_PIPELINE_STATE_STREAM_MS  

To call the shader, run  

DispatchMesh(ThreadGroupCountX, ThreadGroupCountY, ThreadGroupCountZ)

Which will launch either the Mesh Shader or the Amplification Shader if it is present. You can also use  

void ExecuteIndirect(  
    ID3D12CommandSignature *pCommandSignature,  
    UINT MaxCommandCount,  
    ID3D12Resource *pArgumentBuffer,  
    UINT64 ArgumentBufferOffset,  
    ID3D12Resource *pCountBuffer,  
    UINT64 CountBufferOffset );

To launch the shaders from the GPU instead of the CPU.  

Getting Started 

To use Mesh Shaders and Amplification Shaders in your application, install the latest Windows 10 Insider Preview build and SDK Preview Build for Windows 10 (20H1) from the Windows Insider Program. You’ll also need to download and use the latest DirectX Shader Compiler. Finally, because this feature relies on GPU hardware support, you’ll need to contact GPU vendors to find out specifics regarding supported hardware and drivers. 

You can find more information in the Mesh Shader specification, located here: 


Comments are closed.