In the works: HLSL Shader Model 6.6
Microsoft and its partners are happy to announce the development of Shader Model 6.6, the latest advancement in HLSL capability.
Shader Model 6.6 will grant shader developers increased flexibility to enhance and expand existing rendering approaches and devise all new ones. New features include expanded atomic operations, dynamic resource binding, derivatives and samples in compute shaders, packed 8-bit computations, and wave size.
64-bit Integer Atomic Operations
Shader Model 6.6 will introduce the ability to perform atomic arithmetic, bitwise, and exchange/store operations on 64-bit values.
All the following atomic intrinsic functions and methods will take 64-bit values when used on RWByteAddressBuffer and RWStructuredBuffer types in all shader stages:
void InterlockedAdd(inout BufType dest, int64_t value, out int64_t orig); void InterlockedAnd(inout BufType dest, int64_t value, out int64_t orig); void InterlockedOr(inout BufType dest, int64_t value, out int64_t orig); void InterlockedXor(inout BufType dest, int64_t value, out int64_t orig); void InterlockedMin(inout BufType dest, int64_t value, out int64_t orig); void InterlockedMax(inout BufType dest, int64_t value, out int64_t orig); void InterlockedExchange(inout BufType dest, int64_t value, out int64_t orig); void InterlockedCompareStore(inout BufType dest, int64_t cmpval, int64_t value); void InterlockedCompareExchange(inout BufType dest, int64_t cmpval, int64_t value, out int64_t orig);
Where RWByteAddressBuffer methods are concerned, each of these will have a *64 suffix to indicate the expected type.
Shader Model 6.6 will include optional support for other resource and variable types. Typed resources, including writeable typed buffers and textures, will be supported where AtomicInt64OnTypedResourceSupported option is set. Shared memory groupshared variables will be supported where AtomicInt64OnGroupSharedSupported is set.
Integer Atomics on Float-Typed Resources
Shader Model 6.6 will introduce support for using floating point values in the existing integer compare and exchange intrinsic functions. The functions that use compares use bitwise compares and not true floating point compares:
void InterlockedExchange(inout BufType dest, float value, out float orig); void InterlockedCompareStoreFloatBitwise(inout BufType dest, float cmpval, float value); void InterlockedCompareExchangeFloatBitwise(inout BufType dest, float cmpval, float value, out float orig);
InterlockedExchange was an existing intrinsic function extended to include floats since it involved no compare, no new suffix was needed. The ByteAddressBuffer version was given a *Float suffix to indicate the intended type.
Dynamic Resource Binding
Shader Model 6.6 will introduce the ability to create resources from descriptors by directly indexing into the CBV_SRV_UAV heap or the Sampler heap. This resource creation method eliminates the need for root signature descriptor table mapping but requires new global root signature flags to indicate the use of each heap.
The feature is exposed as two new builtin global indexable objects: ResourceDescriptorHeap and SamplerDescriptorHeap. Indexing these global objects returns an internal handle object. This object can be assigned to temporary resource or sampler objects without requiring resource binding locations or mapping through root signature descriptor tables.
<resource variable> = ResourceDescriptorHeap[uint index]; <sampler variable> = SamplerDescriptorHeap[uint index];
The assigned variable must match the heap type of the indexed array.
Compute Shader Derivatives and Samples
Shader Model 6.6 will introduce derivative and sample intrinsic functions to compute shaders. Previous shader models restricted these functions to pixel shaders.
Derivative operations depend on 2×2 quads. Compute shaders don’t have quads. So in order to map these functions to a compute shader which views data as a serial sequence, we’ve defined the quads these functions operate on according to the compute shader lane index. One quad consists of the first four elements in the land index sequence in left-to-right and then top-to-bottom order. Another quad similarly consists of the next four and so on. This gives the 2×2 quads that the following intrinsic functions operate on.
The derivative functions added:
T ddx(in T value) T ddx_coarse(in T value) T ddy(in T value) T ddy_coarse(in T value) T ddx_fine(in T value) T ddy_fine(in T value)
The sample functions added:
float TexObject::CalculateLevelOfDetail( SamplerState sampler_state, F pos ) float TexObject::CalculateLevelOfDetailUnclamped( SamplerState sampler_state, F pos ) R TexObject::Sample( SamplerState sampler_state, F location) R TexObject::SampleBias( SamplerState sampler_state, F location, float Bias) float TexObject::SampleCmp( SamplerComparisonState S, F location, float cmpval)
These operations will be optionally available for Amplification and Mesh shader stages where the DerivativesInMeshAndAmplificationShadersSupported capability bit is set.
Packed 8-Bit Operations
Shader Model 6.6 will add a new set of intrinsic functions for processing packed 8-bit data. These are useful to reduce bandwidth usage where lower precision calculations are acceptable.
These are the new data types representing a vector of packed 8-bit values:
uint8_t4_packed // 4 packed uint8_t values in a uint32_t int8_t4_packed // 4 packed int8_t values in a uint32_t
These new types can be cast to and from uint32_t values without a change in the bitwise representation.
The pack intrinsic functions allow packing a vector of 4 signed or unsigned values into a packed 32-bit value represented by the new packed data types. One version performs a datatype clamp and the other simply drops the unused bits.
uint8_t4_packed pack_u8(uint32_t4 unpackedVal); // Pack lower 8 bits, drop unused bits int8_t4_packed pack_s8(int32_t4 unpackedVal); // Pack lower 8 bits, drop unused bits uint8_t4_packed pack_u8(uint16_t4 unpackedVal); // Pack lower 8 bits, drop unused bits int8_t4_packed pack_s8(int16_t4 unpackedVal); // Pack lower 8 bits, drop unused bits uint8_t4_packed pack_clamp_u8(int32_t4 unpackedVal); // Pack and Clamp [0, 255] int8_t4_packed pack_clamp_s8(int32_t4 unpackedVal); // Pack and Clamp [-128, 127] uint8_t4_packed pack_clamp_u8(int16_t4 unpackedVal); // Pack and Clamp [0, 255] int8_t4_packed pack_clamp_s8(int16_t4 unpackedVal); // Pack and Clamp [-128, 127]
To unpack 32-bit values representing 4 8-bit values into a vector of 16 bit or 32 bit signed or unsigned values:
int16_t4 unpack_s8s16(int8_t4_packed packedVal); // Sign Extended uint16_t4 unpack_u8u16(uint8_t4_packed packedVal); // Non-Sign Extended int32_t4 unpack_s8s32(int8_t4_packed packedVal); // Sign Extended uint32_t4 unpack_u8u32(uint8_t4_packed packedVal); // Non-Sign Extended
Shader Model 6.6 will introduce a new compute shader attribute that allows the shader author to specify a wave size that the compute shader is compatible with.
This feature allows the application to guarantee that a shader will be run at the required wave size. With this attribute, DirectX 12 runtime validation will fail if shaders in a pipeline state object have a required wave size that is not in the range reported by the driver. Because use of this feature limits shader flexibility, we only recommended it for shaders compatible with only one wave size.
The required wave size is specified by an attribute before the entry function. The allowed wave sizes that an HLSL shader may specify are the powers of 2 between 4 and 128, inclusive. In other words, the set: [4, 8, 16, 32, 64, 128].
[WaveSize(<numLanes>)] void main() ...
<numLanes> must be an immediate integer value of an allowed wave size.
Shader Model 6.6 is a work in progress.
Initial compiler implementation of these features will be submitted to the DirectXShaderCompiler GitHub repository at https://github.com/microsoft/DirectXShaderCompiler. We will continue making additional improvements there over the coming months. Hardware vendors will be implementing driver backend support in parallel. Once complete, we will formally release Shader Model 6.6 and developers can take full advantage of it.