October 18th, 2007

New Intrinsic Support in Visual Studio 2008

  Hello. This is Dylan Birtolo, a UE writer for Visual C++. This is my first vcblog entry, but hopefully I will be more of a regular contributor. One of my most recent tasks was to incorporate the documentation for all of the new intrinsic functions that are being put into Visual Studio 2008 for VC++. It is very exciting since support for over 100 intrinsics were added.

Before getting to the intrinsics themselves, it is important to mention why you should prefer using intrinsics when it is possible for you to use inline assembly (inline asm) to access the instructions directly. Here are some reasons to consider using the intrinsics:

  • Inline asm is not supported by Visual C++ on 64-bit machines. Therefore, if you want your code to be 64-bit compatible, you need to use intrinsics.
  • Ease of use. The intrinsics do not require you to be aware of registers or manage memory directly. Instead, you have a function that is complete with inputs and return values. This makes the instructions more accessible to a wider range of technical expertise.
  • The intrinsics are updated in the compiler. What this means from a user perspective is that if the compiler improves how it handles the intrinsics, you receive this benefit immediately. Otherwise, if you are using inline asm, you will be responsible for making any improvements.
  • The optimizer does not work well with inline asm code, so it is recommended that you write inline asm code in its own function, assemble it, and link it in. With the intrinsics, those additional steps are not necessary.
  • Intrinsics are also more portable over code that uses inline asm.

Now let’s get back to the new intrinsics. For the most part, these functions provide support for the Supplemental Streaming SIMD Extensions 3 (SSSE3), Streaming SIMD Extensions 4.1 (SSE4.1), SSE4.2, and SSE4A intrinsics. A handful of instructions were also created to support advanced bit manipulation instructions not available on earlier chipsets. All of these new intrinsics are first supported by the Penryn and Nehalem architectures for Intel and the Third-Generation AMD Opteron processors for AMD. However, regardless of your processor, you should always verify that a given intrinsic is supported before you attempt to use it. Not doing so could result in a run-time error.

To facilitate this verification process, the CPUID instruction has been updated. The latest copy of the documentation for Visual Studio 2008 contains a sample program in the topic __cpuid that you can copy, compile, and use. It is currently up to date and prints out in plain text what technologies your processor supports.

All of the intrinsics are straightforward and have documentation as well as code samples. Take a look. Tables for the new intrinsics can be found in the following three topics: SSE4A and Advanced Bit Manipulation Intrinsics, Streaming SIMD Extensions 4 Instructions, and Supplemental Streaming SIMD Extensions 3 Instructions.

Here is a list of the new intrinsics, organized by the instruction they support. Several of the instructions are very similar and only differ based on the size of the input parameters. To save space, these instructions are listed together. The one unusual case that bears some special consideration is POPCNT. This is listed both under SSE4.2 and ABM. This is so that the intrinsics are compatible with both the AMD and Intel compilers.

  • SSE
    • CVTSI2SS – Converts a 64-bit signed integer to a floating point value and inserts it into a 128-bit parameter. Intrinsics: _mm_cvtsi64_ss
    • CVTSS2SI – Extracts a 32-bit floating point value and rounds it to a 64-bit integer. Intrinsics: _mm_cvtss_si64
    • CVTTSS2SI – Extracts a 32-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttss_si64
  • SSE2
    • CVTSD2SI – Extracts the lowest 64-bit floating point value and rounds it to an integer. Intrinsics: _mm_cvtsd_si64
    • CVTSI2SD – Extracts the lowest 64-bit integer and converts it to a floating point value. Intrinsics: _mm_cvtsi64_sd
    • CVTTSD2SI – Extracts a 64-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttsd_si64
    • MOVNTI – Writes 64 bits to a specified memory location. Intrinsics: _mm_stream_si64
    • MOVQ – Moves a 64-bit integer either to or from a 128-bit parameter. Intrinsics: _mm_cvtsi64_si128, _mm_cvtsi128_si64
  • SSSE3
    • PABSB / PABSW / PABSD – Gets the absolute value of signed integers. Intrinsics: _mm_abs_epi8, _mm_abs_epi16, _mm_abs_epi32, _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32
    • PALIGNR – Combines two parameters and right-shifts the result. Intrinsics: _mm_alignr_epi8, _mm_alignr_pi8
    • PHADDSW – Adds two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hadds_epi16, _mm_hadds_pi16
    • PHADDW / PHADDD – Adds two parameters that contain signed integers. Intrinsics: _mm_hadd_epi16, _mm_hadd_epi32, _mm_hadd_pi16, _mm_hadd_pi32
    • PHSUBSW – Subtracts two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hsubs_epi16, _mm_shubs_pi16
    • PHSUBW / PHSUBD – Subtracts two parameters that contain signed integers. Intrinsics: _mm_hsub_epi16, _mm_hsub_epi32, _mm_hsub_pi16, _mm_hsub_pi32
    • PMADDUBSW – Multiplies and adds together 8-bit integers. Intrinsics: _mm_maddubs_epi16, _mm_maddubs_pi16
    • PMULHRSW – Multiplies 16-bit signed integers and right shifts the results. Intrinsics: _mm_mulhrs_epi16, _mm_mulhrs_pi16
    • PSHUFB – Selects and shuffles 8-bit chunks from a 128-bit parameter. Intrinsics: _mm_shuffle_epi8, _mm_shuffle_pi8
    • PSIGNB / PSIGNW / PSIGND – Negates, zeroes, or preserves signed integers. Intrinsics: _mm_sign_epi8, _mm_sign_epi16, _mm_sign_epi32, _mm_sign_pi8, _mm_sign_pi16, _mm_sign_pi32
  • SSE4A
    • EXTRQ – Extracts specified bits from the parameter. Intrinsics: _mm_extract_si64, _mm_extracti_si64
    • INSERTQ – Inserts specified bits into a given parameter. Intrinsics: _mm_insert_si64, _mm_inserti_si64
    • MOVNTSD / MOVNTSS – Writes bits directly to a specified memory location without polluting the caches. Intrinsics: _mm_stream_sd, _mm_stream_ss
  • SSE4.1
    • DPPD / DPPS – Calculates the dot product of two parameters. Intrinsics: _mm_dp_pd, _mm_dp_ps
    • EXTRACTPS – Extracts a specified 32-bit floating point value from the parameter. Intrinsics: _mm_extract_ps
    • INSERTPS – Inserts a 32-bit integer into a 128-bit parameter and potentially zeroes out some bits. Intrinsics: _mm_insert_ps
    • MOVNTDQA – Loads 128 bits of data from a specified memory location. Intrinsics: _mm_stream_load_si128
    • MPSADBW – Calculates eight offset sums of absolute difference. Intrinsics: _mm_mpsadbw_epu8
    • PACKUSDW – Converts 32-bit signed integers to signed 16-bit integers using 16-bit saturation. Intrinsics: _mm_packus_epi32
    • PBLENDW / BLENDPD / BLENDPS / PBLENDVB / BLENDVPD / BLENDVPS – Blends two parameters together various chunk sizes. Intrinsics: _mm_blend_epi16, _mm_blend_pd, _mm_blend_ps, _mm_blendv_epi8, _mm_blendv_pd, _mm_blendv_ps
    • PCMPEQQ – Compares 64-bit integers for equality. Intrinsics: _mm_cmpeq_epi64
    • PEXTRB / PEXTRW / PEXTRD / PEXTRQ – Extracts an integer from the input parameter. Intrinsics: _mm_extract_epi8, _mm_extract_epi16, _mm_extract_epi32, _mm_extract_epi64
    • PHMINPOSUW – Selects the minimum 16-bit unsigned integer and determines its index. Intrinsics: _mm_minpos_epu16
    • PINSRB / PINSRD / PINSRQ – Inserts an integer into a 128-bit parameter. Intrinsics: _mm_insert_epi8, _mm_insert_epi32, _mm_insert_epi64
    • PMAXSB / PMAXSD – Takes signed integers from two parameters and selects the maximum. Intrinsics: _mm_max_epi8, _mm_max_epi32
    • PMAXUW / PMAXUD – Takes unsigned integers from two parameters and selects the maximum. Intrinsics: _mm_max_epu16, _mm_max_epu32
    • PMINSB / PMINSD – Takes signed integers from two parameters and selects the minimum. Intrinsics: _mm_min_epi8, _mm_min_epi32
    • PMINUW / PMINUD – Takes unsigned integers from two parameters and selects the minimum. Intrinsics: _mm_min_epu16, _mm_min_epu32
    • PMOVSXBW / PMOVSXBD / PMOVSXBQ / PMOVSXWD / PMOVSXWQ / PMOVSXDQ – Converts signed integers of one size to a larger size. Intrinsics: _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64, _mm_cvtepi32_epi64
    • PMOVZXBW / PMOVZXBD / PMOVZXBQ / PMOVZXWD / PMOVZXWQ / PMOVZXDQ – Converts unsigned integers of one size to a larger size. Intrinsics: _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64, _mm_cvtepu32_epi64
    • PMULDQ – Multiplies 32-bit signed integers and stores the result as 64-bit signed integers. Intrinsics: _mm_mul_epi32
    • PMULLUD – Multiplies 32-bit signed integers. Intrinsics: _mm_mullo_epi32
    • PTEST – Calculates a bitwise test of two 128-bit parameters and returns a value based on the CF and ZF bits of the CC flags register. Intrinsics: _mm_testc_si128¸ _mm_testnzc_si128, _mm_testz_si128
    • ROUNDPD / ROUNDPS – Rounds floating point values. Intrinsics: _mm_ceil_pd, _mm_ceil_ps, _mm_floor_pd, _mm_floor_ps, _mm_round_pd, _mm_round_ps
    • ROUNDSD / ROUNDSS – Combines two parameters, rounding a floating point value from one of them. Intrinsics: _mm_ceil_sd, _mm_ceil_ss, _mm_floor_sd, _mm_floor_ss, _mm_round_sd, _mm_round_ss
  • SSE4.2
    • CRC32 – Calculates the CRC-32C checksum of a parameter. Intrinsics: _mm_crc32_u8¸ _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64
    • PCMPESTRI / PCMPESTRM – Compares two parameters of specified length. Intrinsics: _mm_cmpestra, _mm_cmpestrc, _mm_cmpestri, _mm_cmpestrm, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz
    • PCMPGTQ – Compares two parameters. Intrinsics: _mm_cmpgt_epi64
    • PCMPISTRI / PCMPISTRM – Compares two parameters. Intrinsics: _mm_cmpistra, _mm_cmpistrc, _mm_cmpistri, _mm_cmpistrm, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz
    • POPCNT – Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64
  • Advanced Bit Manipulation
    • LZCNT – Counts the number of zeroes at the start of a parameter. Intrinsics: __lzcnt16, __lzcnt, __lzcnt64
    • POPCNT – Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64
  • Other new intrinsics
    • _InterlockedCompareExchange128 – Compares two parameters.
    • _mm_castpd_ps / _mm_castpd_si128 / _mm_castps_pd / _mm_castps_si128 / _mm_castsi128_pd / _mm_castsi128_ps – Reinterprets between 32-bit floating point values (ps), 64-bit floating point values (pd), and 32-bit integers (si128).
    • _mm_cvtsd_f64 – Extracts the lowest 64-bit floating point value from the parameter.
    • _mm_cvtss_f32 – Extracts a 32-bit floating point value.
    • _rdtscp – Generates RDTSCP. Writes TSC AUX[31:0] to memory and returns the 64-bit Time Stamp Counter result.
Category
C++

0 comments

Discussion are closed.