What is IEEE754

Micael Coutinho,Wed Nov 09 2022 • programming ieee

Learn here how your computer deals with floating-point operations and how to take advantage of it

I have a simple but often overlooked question for you: how does a computer, or any modern microcontroller represent floating-point numbers in the hardware? That's right, a question you have probably never asked yourself, but of extreme importance! Today, we will learn how the standard IEEE 754, for floating-point arithmetic, has defined floating-point numbers and what are the advantages that made IEEE choose this in detriment of something different.

Firstly, let's see what the standard is all about. It defines:

Arithmetic Formats - binary and decimal floating-point numbers, which can be finite numbers (including the signed zero and subnormal numbers, which are numbers smaller than the smallest number in normalized form and help avoid underflows), infinites and NaNs (Not a Number)
Interchange Formats - encodings that can be used to exchange floating-point data efficiently and and in a compact way
Rounding Rules - conditions to be fulfilled when rounding numbers while performing conversions and arithmetic operations
Operations - operations on arithmetic formats
Exception Handling - edge cases, such as division by zero and onderflow

IEEE 754 is not the only way to represent floating-point numbers, but through time it has proven to be the most efficient. It's comprised of 3 parts:

Sign - the sign of the number, defining it as positive or negative
Exponent - represents both positive and negative exponents. It needs to be added to a bias to get the actual exponent
Mantissa - mantissa of the number, which contains a 1, implicitly, so, it's actually normalized

When it comes to floating-point numbers, we are used to floats or double types. The differences start with the size: 32 bits for a float, 64 for a double, and the rest:

Type	Sign	Exponent	Mantissa	Bias
float	1 bit	8 bits	23 bits	127
double	1 bit	11 bits	52 bits	1023

Lastly, let's take a look at the special values. These are values that need some extra care, because of their nature. They are:

Zero - It can't be normalized. So, in the normalized form of a mantissa, it's impossible to achieve. For this, a special value of all zeroes represents zero. It can also have the sign bit set, to be a positive or negative zero, but they will compare as the same
Infinites - Infinites are represented as an exponent of all ones, and a mantissa that's all zeroes. They differenciate from each other through the sign bit
Subnormals - We have already discussed them a little bit before. They represent numbers smaller than the smallest possible nomalized number, and are represented by an exponent of all zeroes and a non-zero mantissa. The purpose of them is to guarantee that addition and subtraction of floating-point number does not cause an underflow
NaN (Not a Number) - Used to represent values that are not real numbers, such as the result of a division by zero. The value can be a quiet NaN, which represent an interteminate value, such as dividing infinity by infinity, and signaling NaN's which have a leading zero in the mantissa, and are used for invalid operations.

Alright, now that you know how floating numbers are represented in hardware, we advise you to learn some cool new tricks, such as Quake's fast inverse square root, which is mind-boggling!