# Do You Prefer Fast or Precise?

### What is this Blog About?

My name is Jim Hogg, a Program Manager in the Compilers team.

We would like your feedback on a feature of the Visual C++ compiler that affects the code we generate for floating-point operations. Your answers will help determine what we do. **You can vote via survey — it should not take you more than a few minutes to fill out!**

### OK, I’m Still Reading . . .

The C and C++ languages let you declare variables of type float or double. We call these “floating-point” types. And the Visual C++ compiler lets you specify how it should treat calculations involving these floating-point variables. The options we discuss in this blog are /fp:fast and /fp:precise.

Today’s default is /fp:precise. This blog asks for your feedback on whether we should change the default to /fp:fast. This change would make your code run faster; but might reduce the accuracy of the results, depending upon the calculations involved.

There are many excellent articles that explain floating-point in detail. This blog, by contrast, includes an appendix providing a homely overview – enough for you to form an opinion on the issue of changing the default to /fp:fast. Readers who want to dig deeper may explore the links at the end of this post.

[Note that you have control either way: you can specify that the compiler should follow /fp:fast or /fp:precise down to the level of each .cpp file, or even each function]

Please let us know what you think, after reading this blog post, by filling out this short survey.

### Notation

This blog uses the notation 1.2E+34 as short-hand for 1.2 * 10^{34}. If the “fraction” part is 1.0, we abbreviate further: so 1.0E+23 is shortened to E+23.

### Floating Point Basics

In C++, a float can store a value in the 3 (approximate) disjoint ranges { [-E+38, -E-38], 0, [E-38, E+38] }. Each float consumes 32 bits of memory. In this limited space, a float can only store approximately 4 billion different values. It does this in a cunning way, where adjacent values for small numbers lie close together; while adjacent values for big numbers lie far apart. You can count on each float value being accurate to about 7 decimal digits.

### Floating Point Calculations

We all understand how a computer calculates with ints. But what about floats? One obvious effect is that if I add a big number and a small number, the small one may simply get lost. For example, E+20 + E-20 results in E+20 – there are not enough bits of precision within a float to represent the precise/exact/correct value.

Similarly, each calculation using floats has to round the precise result to fit within the space available (actually 23 bits). Depending upon the calculation, the result may differ a little, or a lot, from the mathematical result (the one you’d get if you had lots and lots of bits available).

Here is a simple example:

1 | int main() {<br /> float inc = 0.000001, sum = 0.0;<br /> for (int i = 1; i <= 1000000; ++i) sum += inc;<br /> printf("Sum = %f \n", sum);<br />} |

You would expect this program to add inc (one-millionth) to sum, one million times, resulting in an answer of 1.0. But one-millionth can only be represented approximately as a float (actually 0x358637bd), so the result obtained is not 1.0, but 1.009039.

To scare ourselves even more, note that calculations with floats do not obey all rules of algebra. For example, associativity of addition states that: (a + b) + c == a + (b + c). But floats don’t quite abide by that rule. For example:

- (E-10 + E10) + -E10 = E10 + -E10 = 0
- E-10 + (E10 + -E10) = E-10 + 0 = E-10

So results can differ, depending upon the order in which we perform the operations.

Floating-point calculations don’t obey all the laws of algebra – but in many cases, it’s “close enough” to the mathematically precise answer. [Eg: if we calculate the stress on a bridge truss to be 1.2593 tonnes, but the accurate value is 1.2592 tonnes, we’re likely happy: the bridge won’t fall down]

### What Does /fp:fast Do?

By throwing the /fp:fast switch, you tell the compiler it should pretend that floats (and doubles) obey the rules of simple algebra (associativity and distributivity). This allows the compiler to optimize your code so that it runs faster. It trades off accuracy for speed. (It also lets the compiler play a fast-and-loose with that sub-species of floats called NaNs – “Not a Number” – see below)

### How Fast is /fp:fast?

Just how much speedup will you get by enabling /fp:fast? Here are results we found using a few common benchmarks:

Name | Area | Speedup (x86) |

Parsec | Next-Gen Shared Memory | 1.58 |

Eigen | Linear Algebra | 1.03 |

Spec FP 2006 | CPU & Memory | 1.03 |

“Speedup” is defined as follows: denote the time to execute the benchmark, when compiled under /fp:precise, as Tprecise. Correspondingly, Tfast. Then “Speedup” is Tprecise/Tfast.

Note that the speedup you achieve will depend on the details of your App. For example, we measured a huge range of speedups among the individual Parsec benchmarks: from 1.0 (ie, no speedup) up to a massive 5.2x!

### How Inaccurate is /fp:fast?

As with speedup, accuracy of results will vary App to App. If your App, or test program, computes a simple result, then comparison is simple. But if your App computes hypersonic airflow round an airfoil, comparison is more challenging.

If your App is a game, then some calculations need only be accurate enough to plot the right color on the right pixels (so a display of 2048 columns needs an accuracy of 1 part in a few thousand). With game Apps, it’s unlikely you would even see any difference in the display between /fp:fast and /fp:precise. [Xbox games are compiled, by default, with /fp:fast]

### Counter Example

The explanations so far would lead you to expect that /fp:fast will sometimes (maybe always?) produce a result that is less accurate than /fp:precise. As a simple example, let’s consider the sum of the first million reciprocals, or Sum(1/n) for n = 1..1000000. I calculated the approximate result using floats, and the correct result using Boost’s cpp_dec_float (to a precision of 100 decimal digits). With /O2 level of optimization, the results are:

float /fp:precise | 14.3574 |

float /fp:fast | 14.3929 |

cpp_dec_float<100> | 14.39272672286 |

So the /fp:fast result is nearer the correct answer than the /fp:precise!

How can this be? With /fp:fast the auto-vectorizer emits the SIMD RCPPS machine instruction, which is both faster and more accurate than the DIVSS emitted for /fp:precise.

This is just one specific case. But the point is that even a complete error analysis won’t tell you whether /fp:fastis acceptable in your App – there’s more going on. The only way to be sure is to test your App under each regime and compare answers.

### What about Doubles?

This blog has described what happens with floats under /fp:fast. doubles are similar to floats, but occupy 64 bits, rather than 32; they have more bits dedicated to both significand and exponent. In some sense (which we won’t spell out), they obey the rules of algebra more closely than floats. But you can still observe the effects of rounding errors and their propagation through calculation. /fp:fast affects the behavior of both floats and doubles.

### Next Steps?

Please try an App, or test programs, with /fp:fast rather than the default of /fp:precise. Compare speed and accuracy. Based on this experience, please tell us whether you would agree for us to change the default for the Visual C++ compiler to /fp:fast. **Let us know what you think, by filling out this short survey.**

### Appendices

The next few sections, numbered A1, A2, etc provide a little more detail on floating-point. If this whets your appetite for more, please follow the links at the end of the post.

### A1. Integers

An intvariable in Visual C++ is 32 bits wide. It can store any whole number in the range -2,147483,648 to 2,147,483,647, inclusive. Adjacent values are spread evenly along the real number line, each lying 1 unit away from its neighbor.

### A2. Floating Point Format

Calculations in science or engineering need to represent fractional values, whose range is also wider than the 4 billion or so afforded by the ints. How can we possibly represent such an enormous range of numbers within the 32 bits that comprise a float? Answer: we divide our precious 32 bits into 3 chunks, like this:

- S, a 1-bit sign. 0 denotes positive. 1 denotes negative.
- V, a 23-bit “significand”. A binary fraction, where bits range in value from 2-1 to 2-23. (Actually, we normalize the original binary number so as to make its most significant bit a 1; which we therefore don’t need to store; so we really achieve 24 bits of precision)
- E, an 8-bit exponent. As an 8-bit, unsigned integer, this field can store values [0, 255]. But the values 0 and 255 are reserved (used to denote zeros, sub-normals, infinities and NaNs (see links for details). From the stored exponent value, we subtract 127 (the exponent “bias” – fixed for all floats) to get the actual exponent, in the range [-126, 127].

The value of a float is given by: (-1)S * (1 + V) * 2 (E – 127). Here is an example:

0 0111 1110 101 0000 0000 0000 0000 0000

- S = sign = 0, so this is a positive number
- E = exponent = 0111 1110, or 126 (decimal). Subtract 127 to get the actual exponent of -1.
- V = significand = 1 + (1 * 0.5) + (0 * 0.25) + (1 * 0.125) = 1.625

So the value of this particular float is 1.625 * 2-1 = 0.8125

We can readily see that the smallest float magnitude is therefore: 1 * 2^(-126) or about E-38. And the largest is: 2 * 2^127, or about E+38. (The interested reader can explore the topic of “sub-normal” values, which lie closer to zero, in links at the end of the blog)

### A3. How Do They Do That?

We appear to have achieved the impossible! Within 32 bits, floats can represent any number in the approximate range [-E38, +E38]. This is vastly wider than of a 32-bit int, which spans approximately [-2E9, +2E9]. What’s going on?

One way to span the wide range would be to use an int, but multiply its value by a big number, such as E29. That would let us span the range [-2E38, +2E38]. But the smallest number after zero we could represent would be many miles away, at E29! [We would call this a fixed-point, rather than floating-point, format]. Such a system is doomed to failure. We need something better.

In fact, floats vary the distance between neighbors: small values, such as E-20, lie very close together; large values, such as E+20, lie ‘miles’ apart. As you proceed thru the range, you need to take bigger and bigger jumps to reach the next float value. So floats allow us to represent a finite number of values in the approximate range [-E38, +E38] – but not all such possible values. Here are 3 examples of adjacent floats (they differ by the least significant bit in their significand):

- 0 0011 1111 000 0000 0000 0000 0000 0000 ~= 5.42101E-20
- 0 0011 1111 000 0000 0000 0000 0000 0001 ~= 5.4210115E-20

(The ~= means approximately equal). So these two very small, neighboring values, lie about 0.000015E-20 (1.5E-25) apart. (ie, a handful of yocto-metres)

- 0 0111 1111 000 0000 0000 0000 0000 0000 = 1.0
- 0 0111 1111 000 0000 0000 0000 0000 0001 ~= 1.000 000 1

So these two, middle-of-the-road, neighboring values, lie about E-7 apart. (ie, 100 nano-metres)

- 0 1100 0010 000 0000 0000 0000 0000 0000 ~= 1.4757395E+20
- 0 1100 0010 000 0000 0000 0000 0000 0001 ~= 1.4757397E+20

So these two very large, neighboring values, lie about 2E14 apart! (ie, a light-week)

### A4. Rounding Errors – Analogy

Use a pocket calculator to work out: 1.23 * 2.45 * 3.67. I get the answer 11.059545.

Now repeat, but round each intermediate result to hold just 3 significant digits. So we get:

- 1.23 * 2.45 = 3.0135, rounded gives 3.01
- 3.01 * 3.67 = 11.0467, rounded give 11.05

This answer is slightly wrong. It is 0.009545 too small. And that’s because we forced intermediate results to fit within the 3 decimal digits of our hobbled calculator. A similar thing happens when the computer uses floats – the calculated answer drifts up or down from the mathematically correct answer, because intermediate results are made to fit within the float’s limited-size. [This is a simplification – see links for details]

### A5. Nasty Numbers

Given some float variable, x, the compiler would like to assume that any intermediate calculation involving the expression (x – x) can be replaced by 0. But that’s not true if x has any of the special values NaN, +infinity or –infinity. (See later link for explanation). If you specify /fp:fast, the compiler will optimize (x – x) to zero. If not, it will perform the calculation, and thereby run more slowly. If x happens to have the value NaN, then the correct result for (x – x) would have been, not 0, but NaN.

### A6. Constant Sub-Expression Elimination

This, and the following two sections, give examples of the effects of enabling /fp:fast. Suppose the compiler generates the following, simplified-C-code for a function in your program:

Note that t2 = (a * b) * c, whilst t4 = a * (b * c). With /fp:precise, the compiler cannot not assume that t2 == t4 and would generate code to calculate t2 and, separately, to calculate t4. With /fp:fast, the compiler is allowed to infer that t2 and t4 have the same value. So it will calculate t2, and simply re-use that value for t4 (rather than calculate it over again). Of course, in many cases, the calculated values will be identical, or very close. If you’re unlucky (pathological differences in the magnitudes of the participating operands) the calculated results could be different.

### A7. Auto-Vectorization

The /fp:fast switch lets the optimizer perform auto-vectorization of code patterns not otherwise allowed. (See the sequence of blog posts on auto-vectorization). For example, suppose our program calculates the sum of an array of 100 floats. This would take 100 iterations of a simple loop. But we can use the chip’s vector registers to get the answer in just 25 iterations, performing 4 calculations in parallel on each iteration. So, instead of:

- sum = a[0] + a[1] + a[2] + a[3] + a[4] + a[5] + . . . a[99]

we split the calculation into 4 partial sums, sum0 thru sum3, which we run in parallel; then add them together:

- sum0 = a[0] + a[4] + a[8] + . . . a[96]
- sum1 = a[1] + a[5] + a[9] + . . . a[97]
- sum2 = a[2] + a[6] + a[10] + . . . a[98]
- sum3 = a[3] + a[7] + a[11] + . . . a[99]
- sum’ = sum0 + sum1 + sum2 + sum3

Does sum’ == sum ? Only if (a[0]+a[4]+…) + (a[1]+a[5]+…) + (a[2]+a[6]+…) + ([a[3]+a[7]+…) == a[0] + a[1] + a[2] +… This holds under associativity, which floats don’t adhere to, all of the time. Specifying /fp:fast lets the compiler transform your code to run faster – up to 4 times faster, for this simple calculation.

### Links – More Details on Floating Point

- Floating point on Wikipedia. A peek into the complexities of floating-point.
- IEEE floating point on Wikipedia. If you enjoyed the previous topic, dig deeper with this one.
- Online Floating-Point Calculator – great for experimenting with the bits inside a float or double.
- What Every Computer Scientist Should Know About Floating-Point Arithmetic. Thorough treatment. [Hazard warning for the mathematically-averse: includes theorems and proofs]
- /fp Switches on MSDN. (Actually I notice this article is out-of-date, now that we default to SSE2 instructions and registers, instead of the ancient 80-bit, x87 FPU. Must remember to post a fix)
- Microsoft Visual C++ Floating-Point Optimization. Old, but still apt.
- The IEEE-754-2008 Technical Standard: 58 pages of wall-to-wall fun.

## 0 comments