New Intrinsic Support in Visual Studio 2008

Hello. This is Dylan Birtolo, a UE writer for Visual C++. This is my first vcblog entry, but hopefully I will be more of a regular contributor. One of my most recent tasks was to incorporate the documentation for all of the new intrinsic functions that are being put into Visual Studio 2008 for VC++. It is very exciting since support for over 100 intrinsics were added.

Before getting to the intrinsics themselves, it is important to mention why you should prefer using intrinsics when it is possible for you to use inline assembly (inline asm) to access the instructions directly. Here are some reasons to consider using the intrinsics:

Inline asm is not supported by Visual C++ on 64-bit machines. Therefore, if you want your code to be 64-bit compatible, you need to use intrinsics.
Ease of use. The intrinsics do not require you to be aware of registers or manage memory directly. Instead, you have a function that is complete with inputs and return values. This makes the instructions more accessible to a wider range of technical expertise.
The intrinsics are updated in the compiler. What this means from a user perspective is that if the compiler improves how it handles the intrinsics, you receive this benefit immediately. Otherwise, if you are using inline asm, you will be responsible for making any improvements.
The optimizer does not work well with inline asm code, so it is recommended that you write inline asm code in its own function, assemble it, and link it in. With the intrinsics, those additional steps are not necessary.
Intrinsics are also more portable over code that uses inline asm.

Now let’s get back to the new intrinsics. For the most part, these functions provide support for the Supplemental Streaming SIMD Extensions 3 (SSSE3), Streaming SIMD Extensions 4.1 (SSE4.1), SSE4.2, and SSE4A intrinsics. A handful of instructions were also created to support advanced bit manipulation instructions not available on earlier chipsets. All of these new intrinsics are first supported by the Penryn and Nehalem architectures for Intel and the Third-Generation AMD Opteron processors for AMD. However, regardless of your processor, you should always verify that a given intrinsic is supported before you attempt to use it. Not doing so could result in a run-time error.

To facilitate this verification process, the CPUID instruction has been updated. The latest copy of the documentation for Visual Studio 2008 contains a sample program in the topic __cpuid that you can copy, compile, and use. It is currently up to date and prints out in plain text what technologies your processor supports.

All of the intrinsics are straightforward and have documentation as well as code samples. Take a look. Tables for the new intrinsics can be found in the following three topics: SSE4A and Advanced Bit Manipulation Intrinsics, Streaming SIMD Extensions 4 Instructions, and Supplemental Streaming SIMD Extensions 3 Instructions.

Here is a list of the new intrinsics, organized by the instruction they support. Several of the instructions are very similar and only differ based on the size of the input parameters. To save space, these instructions are listed together. The one unusual case that bears some special consideration is POPCNT. This is listed both under SSE4.2 and ABM. This is so that the intrinsics are compatible with both the AMD and Intel compilers.

CVTSI2SS – Converts a 64-bit signed integer to a floating point value and inserts it into a 128-bit parameter. Intrinsics: _mm_cvtsi64_ss
CVTSS2SI – Extracts a 32-bit floating point value and rounds it to a 64-bit integer. Intrinsics: _mm_cvtss_si64
CVTTSS2SI – Extracts a 32-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttss_si64

SSE2

CVTSD2SI – Extracts the lowest 64-bit floating point value and rounds it to an integer. Intrinsics: _mm_cvtsd_si64
CVTSI2SD – Extracts the lowest 64-bit integer and converts it to a floating point value. Intrinsics: _mm_cvtsi64_sd
CVTTSD2SI – Extracts a 64-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttsd_si64
MOVNTI – Writes 64 bits to a specified memory location. Intrinsics: _mm_stream_si64
MOVQ – Moves a 64-bit integer either to or from a 128-bit parameter. Intrinsics: _mm_cvtsi64_si128, _mm_cvtsi128_si64

SSSE3

PABSB / PABSW / PABSD – Gets the absolute value of signed integers. Intrinsics: _mm_abs_epi8, _mm_abs_epi16, _mm_abs_epi32, _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32
PALIGNR – Combines two parameters and right-shifts the result. Intrinsics: _mm_alignr_epi8, _mm_alignr_pi8
PHADDSW – Adds two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hadds_epi16, _mm_hadds_pi16
PHADDW / PHADDD – Adds two parameters that contain signed integers. Intrinsics: _mm_hadd_epi16, _mm_hadd_epi32, _mm_hadd_pi16, _mm_hadd_pi32
PHSUBSW – Subtracts two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hsubs_epi16, _mm_shubs_pi16
PHSUBW / PHSUBD – Subtracts two parameters that contain signed integers. Intrinsics: _mm_hsub_epi16, _mm_hsub_epi32, _mm_hsub_pi16, _mm_hsub_pi32
PMADDUBSW – Multiplies and adds together 8-bit integers. Intrinsics: _mm_maddubs_epi16, _mm_maddubs_pi16
PMULHRSW – Multiplies 16-bit signed integers and right shifts the results. Intrinsics: _mm_mulhrs_epi16, _mm_mulhrs_pi16
PSHUFB – Selects and shuffles 8-bit chunks from a 128-bit parameter. Intrinsics: _mm_shuffle_epi8, _mm_shuffle_pi8
PSIGNB / PSIGNW / PSIGND – Negates, zeroes, or preserves signed integers. Intrinsics: _mm_sign_epi8, _mm_sign_epi16, _mm_sign_epi32, _mm_sign_pi8, _mm_sign_pi16, _mm_sign_pi32

SSE4A

EXTRQ – Extracts specified bits from the parameter. Intrinsics: _mm_extract_si64, _mm_extracti_si64
INSERTQ – Inserts specified bits into a given parameter. Intrinsics: _mm_insert_si64, _mm_inserti_si64
MOVNTSD / MOVNTSS – Writes bits directly to a specified memory location without polluting the caches. Intrinsics: _mm_stream_sd, _mm_stream_ss

SSE4.1

DPPD / DPPS – Calculates the dot product of two parameters. Intrinsics: _mm_dp_pd, _mm_dp_ps
EXTRACTPS – Extracts a specified 32-bit floating point value from the parameter. Intrinsics: _mm_extract_ps
INSERTPS – Inserts a 32-bit integer into a 128-bit parameter and potentially zeroes out some bits. Intrinsics: _mm_insert_ps
MOVNTDQA – Loads 128 bits of data from a specified memory location. Intrinsics: _mm_stream_load_si128
MPSADBW – Calculates eight offset sums of absolute difference. Intrinsics: _mm_mpsadbw_epu8
PACKUSDW – Converts 32-bit signed integers to signed 16-bit integers using 16-bit saturation. Intrinsics: _mm_packus_epi32
PBLENDW / BLENDPD / BLENDPS / PBLENDVB / BLENDVPD / BLENDVPS – Blends two parameters together various chunk sizes. Intrinsics: _mm_blend_epi16, _mm_blend_pd, _mm_blend_ps, _mm_blendv_epi8, _mm_blendv_pd, _mm_blendv_ps
PCMPEQQ – Compares 64-bit integers for equality. Intrinsics: _mm_cmpeq_epi64
PEXTRB / PEXTRW / PEXTRD / PEXTRQ – Extracts an integer from the input parameter. Intrinsics: _mm_extract_epi8, _mm_extract_epi16, _mm_extract_epi32, _mm_extract_epi64
PHMINPOSUW – Selects the minimum 16-bit unsigned integer and determines its index. Intrinsics: _mm_minpos_epu16
PINSRB / PINSRD / PINSRQ – Inserts an integer into a 128-bit parameter. Intrinsics: _mm_insert_epi8, _mm_insert_epi32, _mm_insert_epi64
PMAXSB / PMAXSD – Takes signed integers from two parameters and selects the maximum. Intrinsics: _mm_max_epi8, _mm_max_epi32
PMAXUW / PMAXUD – Takes unsigned integers from two parameters and selects the maximum. Intrinsics: _mm_max_epu16, _mm_max_epu32
PMINSB / PMINSD – Takes signed integers from two parameters and selects the minimum. Intrinsics: _mm_min_epi8, _mm_min_epi32
PMINUW / PMINUD – Takes unsigned integers from two parameters and selects the minimum. Intrinsics: _mm_min_epu16, _mm_min_epu32
PMOVSXBW / PMOVSXBD / PMOVSXBQ / PMOVSXWD / PMOVSXWQ / PMOVSXDQ – Converts signed integers of one size to a larger size. Intrinsics: _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64, _mm_cvtepi32_epi64
PMOVZXBW / PMOVZXBD / PMOVZXBQ / PMOVZXWD / PMOVZXWQ / PMOVZXDQ – Converts unsigned integers of one size to a larger size. Intrinsics: _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64, _mm_cvtepu32_epi64
PMULDQ – Multiplies 32-bit signed integers and stores the result as 64-bit signed integers. Intrinsics: _mm_mul_epi32
PMULLUD – Multiplies 32-bit signed integers. Intrinsics: _mm_mullo_epi32
PTEST – Calculates a bitwise test of two 128-bit parameters and returns a value based on the CF and ZF bits of the CC flags register. Intrinsics: _mm_testc_si128¸ _mm_testnzc_si128, _mm_testz_si128
ROUNDPD / ROUNDPS – Rounds floating point values. Intrinsics: _mm_ceil_pd, _mm_ceil_ps, _mm_floor_pd, _mm_floor_ps, _mm_round_pd, _mm_round_ps
ROUNDSD / ROUNDSS – Combines two parameters, rounding a floating point value from one of them. Intrinsics: _mm_ceil_sd, _mm_ceil_ss, _mm_floor_sd, _mm_floor_ss, _mm_round_sd, _mm_round_ss

SSE4.2

CRC32 – Calculates the CRC-32C checksum of a parameter. Intrinsics: _mm_crc32_u8¸ _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64
PCMPESTRI / PCMPESTRM – Compares two parameters of specified length. Intrinsics: _mm_cmpestra, _mm_cmpestrc, _mm_cmpestri, _mm_cmpestrm, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz
PCMPGTQ – Compares two parameters. Intrinsics: _mm_cmpgt_epi64
PCMPISTRI / PCMPISTRM – Compares two parameters. Intrinsics: _mm_cmpistra, _mm_cmpistrc, _mm_cmpistri, _mm_cmpistrm, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz
POPCNT – Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64

Advanced Bit Manipulation

LZCNT – Counts the number of zeroes at the start of a parameter. Intrinsics: __lzcnt16, __lzcnt, __lzcnt64
POPCNT – Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64

Other new intrinsics

_InterlockedCompareExchange128 – Compares two parameters.
_mm_castpd_ps / _mm_castpd_si128 / _mm_castps_pd / _mm_castps_si128 / _mm_castsi128_pd / _mm_castsi128_ps – Reinterprets between 32-bit floating point values (ps), 64-bit floating point values (pd), and 32-bit integers (si128).
_mm_cvtsd_f64 – Extracts the lowest 64-bit floating point value from the parameter.
_mm_cvtss_f32 – Extracts a 32-bit floating point value.
_rdtscp – Generates RDTSCP. Writes TSC AUX[31:0] to memory and returns the 64-bit Time Stamp Counter result.

New Intrinsic Support in Visual Studio 2008

Category

Author

0 comments

Read next

Hello World? Quick Advice On Design Specifications

Visual C++ has left the building (again) – destination Germany (after TechEd Developer Europe.)