What is IEEE754
Learn here how your computer deals with floating-point operations and how to take advantage of it
I have a simple but often overlooked question for you: how does a computer, or any modern microcontroller represent floating-point numbers in the hardware? That's right, a question you have probably never asked yourself, but of extreme importance! Today, we will learn how the standard IEEE 754, for floating-point arithmetic, has defined floating-point numbers and what are the advantages that made IEEE choose this in detriment of something different.
Firstly, let's see what the standard is all about. It defines:
-
Arithmetic Formats - binary and decimal floating-point numbers, which can be finite numbers (including the signed zero and subnormal numbers, which are numbers smaller than the smallest number in normalized form and help avoid underflows), infinites and NaNs (Not a Number)
-
Interchange Formats - encodings that can be used to exchange floating-point data efficiently and and in a compact way
-
Rounding Rules - conditions to be fulfilled when rounding numbers while performing conversions and arithmetic operations
-
Operations - operations on arithmetic formats
-
Exception Handling - edge cases, such as division by zero and onderflow
IEEE 754 is not the only way to represent floating-point numbers, but through time it has proven to be the most efficient. It's comprised of 3 parts:
-
Sign - the sign of the number, defining it as positive or negative
-
Exponent - represents both positive and negative exponents. It needs to be added to a bias to get the actual exponent
-
Mantissa - mantissa of the number, which contains a 1, implicitly, so, it's actually normalized
When it comes to floating-point numbers, we are used to floats or double types. The differences start with the size: 32 bits for a float, 64 for a double, and the rest:
Type | Sign | Exponent | Mantissa | Bias |
---|---|---|---|---|
float | 1 bit | 8 bits | 23 bits | 127 |
double | 1 bit | 11 bits | 52 bits | 1023 |
Lastly, let's take a look at the special values. These are values that need some extra care, because of their nature. They are:
-
Zero - It can't be normalized. So, in the normalized form of a mantissa, it's impossible to achieve. For this, a special value of all zeroes represents zero. It can also have the sign bit set, to be a positive or negative zero, but they will compare as the same
-
Infinites - Infinites are represented as an exponent of all ones, and a mantissa that's all zeroes. They differenciate from each other through the sign bit
-
Subnormals - We have already discussed them a little bit before. They represent numbers smaller than the smallest possible nomalized number, and are represented by an exponent of all zeroes and a non-zero mantissa. The purpose of them is to guarantee that addition and subtraction of floating-point number does not cause an underflow
-
NaN (Not a Number) - Used to represent values that are not real numbers, such as the result of a division by zero. The value can be a quiet NaN, which represent an interteminate value, such as dividing infinity by infinity, and signaling NaN's which have a leading zero in the mantissa, and are used for invalid operations.
Alright, now that you know how floating numbers are represented in hardware, we advise you to learn some cool new tricks, such as Quake's fast inverse square root, which is mind-boggling!
© AutosarToday —@LinkedIn