Thursday, September 2, 2021

The Little Things: Comparing Floating Point Numbers

There is a lot of confusion about floating-point numbers and a lot of bad advice going around. IEEE-754 floating-point numbers are a complex beast, and comparing them is not always easy, but in this post, we will take a look at different approaches and their tradeoffs.

Note that this whole post assumes binary IEEE-754 floating-point numbers. There are more different types of floating-point numbers, e.g. IBM likes decimal floating-point numbers enough to support them in hardware. However, most of the text below should be applicable to different representations too.

Floating point basics

I do not want to get into too many details about the representation of floating-point numbers or their arithmetics, but we still need to go over some important points. They are required to build an understanding of the different comparison methods we will look at later.

Floating-point numbers are a (one) way of dealing with real numbers in fixed-size storage inside a computer. The binary representation consists of 3 parts, the sign bit, the mantissa, and the exponent.

The sign bit should be self-explanatory. It decides which sign the number resulting from the rest of the bits will have. The mantissa stores the digits of the represented number, while the exponent stores the magnitude of the number.

Because the total number of bits split between these three parts is fixed, we must logically lose precision when representing some numbers due to insufficient bits in the mantissa. The fact that the bit allocation to each part of the representation is also fixed means that as we represent higher numbers, the absolute loss of precision increases. However, the relative loss of precision remains the same.

Floating-point numbers also contain some special values used to represent specific "states" outside of normal operations. As an example, if a number is so big it overflows the floating-point type, it will be represented as infinity (or negative infinity in case of underflow). The other important special kind of values are the NaN (Not a Number) values.

There are different types of NaN, but the important part of them is that they are the result of invalid floating point operation, e.g. \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\) and that they behave unintuitively, because \(\textrm{NaN} \neq \textrm{NaN}\).

With this knowledge we can now look at how we can compare two floating-point numbers.

Comparing floating-point numbers

There are 4 (5) different ways to compare floating-point numbers. They are:

  • Bitwise comparison
  • Direct ("exact") IEEE-754 comparison
  • Absolute margin comparison
  • Relative epsilon comparison
  • ULP (Unit In Last Place) based comparison

Apart from bitwise comparison, all of them have their merits (and drawbacks). The bitwise comparison is included only to contrast it with the "exact" comparison, I am not aware of any use for it in the real world.

Bitwise and direct comparison

The idea behind bitwise comparison is exceedingly simple. Two floating-point numbers are equal iff their bit representations are the same.

This is not what happens if you write lhs == rhs in your code.

If you write lhs == rhs in your code, you get what is often called "exact" comparison. However, this doesn't mean that the numbers are compared bitwise because e.g. -0. == 0. and NaN != NaN, even though in the first case both sides have different bit representations, and in the latter case, both sides might have the exact same bit representation

Direct comparison is useful only rarely, but it is not completely useless. Because the basic operations are specified exactly, any computation using only them should provide specific output for an input. The situation is worse for various transcendental functions, but reasonably fast correctly rounded libraries are beginning to exist.

All in all, if you are writing code that does some computations with floating-point numbers and you require the results to be portable, you should have a bunch of tests relying purely on direct comparison.

Absolute margin comparison

Absolute margin comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \textrm{margin}\). This means that two numbers are equal if their distance is less than some fixed margin.

The two big advantages of absolute margin comparison are that it is easy to reason about decimally ("I want to be within 0.5 of the correct result") and that it does not break down close to 0. The disadvantage is that it instead breaks down for large values of lhs or rhs, where it decays into direct comparison.

Relative epsilon comparison

The relative epsilon comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \varepsilon * \max(|\textrm{lhs}|, |\textrm{rhs}|)\). This means that two numbers are equal if they are within some factor of each other.

Unlike margin comparison, epsilon comparison does not break down for large lhs and rhs values. The tradeoff is that it instead breaks down (by decaying to exact comparison) around 0. Just like margin comparison, it is quite easy to reason about decimally ("I want to be within 5% of the correct result").

You can also swap the maximum for a minimum of the two numbers, which gives you a stricter comparison but with the same advantages and disadvantages.

ULP-based comparison

The last option is to compare two numbers based on their ULP distance. The ULP distance of two numbers is how many representable floating-point numbers there is between them + 1. This means that if two numbers do not have any other representable numbers between them, their ULP distance is 1. If there is one number between them, the distance is 2, etc.

The big advantage of using ULP comparisons is that it automatically scales across different magnitudes of compared numbers. It doesn't break down around 0, nor does it break down for large numbers. ULP based comparison is also very easy to reason about numerically. You know what operations happened to the input and thus how far the output can be from the canonical answer and still be considered correct.

The significant disadvantage is that it is very hard impossible to reason about decimally without being an expert in numerical computations. Imagine explaining to a non-technical customer that you guarantee to be within 5 ULP of the correct answer.


So, what does all this mean? What comparison should you use in your code?

Sadly there is no one-size-fits-all answer. When comparing two floating-point numbers, you need to understand your domain and how the numbers came to be and then decide based on that.

What about Catch2?

I maintain a popular testing framework, Catch2, so you might be wondering how does Catch2 handle comparing floating-point numbers. Catch2 provides some useful tools for comparing floating-point numbers, namely Approx and 3 different floating-point matchers, but doesn't make any decisions for you.

Approx is a type that provides standard relational operators, so it can be used directly in assertions and provides both margin and epsilon comparisons. Approx equals a number if the number is either margin or epsilon (or both) equal to the target.

There are two crucial things to remember about Approx. The first is that the epsilon comparison scales only with the Approx'd value, not the min/max of both sides of the comparison. The other is that a default-constructed Approx instance only performs epsilon comparison (margin defaults to 0).

The matchers each implement one of the three approximate comparisons, and since they are matchers, you can arbitrarily combine them to compare two numbers with the desired semantics. However, it is important to remember that the ULP matcher does have a slightly non-standard interpretation of ULP distance.

The ULP matcher's underlying assumption is that the distance between two numbers that directly compare equal should be 0, even though this is not the interpretation by the standard library, e.g. through std::nextafter. This means that e.g. ulpDistance(-0, 0) == 0 as far as the ULP matcher is concerned, leading to other minor differences from naive ULP distance calculations.

Summarizing the behaviour of the ULP matcher:
\[
\begin{align}
x = y &\implies \textrm{ulpDistance}(x, y) = 0 \\
\textrm{ulpDistance}(\textrm{max-finite}, \infty) &= 0 \\
\textrm{ulpDistance}(x, -x) &= 2 \times \textrm{ulpDistance}(x, 0) \\
\textrm{ulpDistance}(\textrm{NaN}, x) &= \infty
\end{align}
\]


That is all for this post. Now you can go and fix floating-point comparisons in your code. Or use this post to win internet arguments. As long as you don't give advice assuming that floating-point comparisons are one-size-fits-all, it is fine by me.



from Hacker News https://ift.tt/2WOxOXQ

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.