Handout 16a

IEEE Arithmetic Model


This section describes the IEEE 754 specification.

What Is IEEE Arithmetic?

IEEE 754 specifies:

The IEEE standard also recommends support for user handling of exceptions.

The features required by the IEEE standard make it possible to support interval arithmetic, the retrospective diagnosis of anomalies, efficient implementations of standard elementary functions like exp and cos, multiple precision arithmetic, and many other tools that are useful in numerical computation.

IEEE 754 floating-point arithmetic offers users greater control over computation than does any other kind of floating-point arithmetic. The IEEE standard simplifies the task of writing numerically sophisticated, portable programs not only by imposing rigorous requirements on conforming implementations, but also by allowing such implementations to provide refinements and enhancements to the standard itself.

IEEE Formats

This section describes how floating-point data is stored in memory. It summarizes the precisions and ranges of the different IEEE storage formats.

Storage Formats

A floating-point format is a data structure specifying the fields that comprise a floating-point numeral, the layout of those fields, and their arithmetic interpretation. A floating-point storage format specifies how a floating-point format is stored in memory. The IEEE standard defines the formats, but it leaves to implementors the choice of storage formats.

Assembly language software sometimes relies on using the storage formats, but higher level languages usually deal only with the linguistic notions of floating-point data types. These types have different names in different high-level languages, and correspond to the IEEE formats as shown in TABLE 2-1.

TABLE 2-1   IEEE Formats and Language Types

IEEE Precision

C, C++

Fortran

single float REAL or REAL*4
double double DOUBLE PRECISION or REAL*8
double extended long double REAL*16 [SPARC only]

IEEE 754 specifies exactly the single and double floating-point formats, and it defines a class of extended formats for each of these two basic formats. The long double and REAL*16 types shown in TABLE 2-1 refer to one of the class of double extended formats defined by the IEEE standard.

The following sections describe in detail each of the storage formats used for the IEEE floating-point formats on SPARC and x86 platforms.

Single Format

The IEEE single format consists of three fields: a 23-bit fraction, f; an 8-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in one 32-bit word, as shown in FIGURE 2-1. Bits 0:22 contain the 23-bit fraction, f, with bit 0 being the least significant bit of the fraction and bit 22 being the most significant; bits 23:30 contain the 8-bit biased exponent, e, with bit 23 being the least significant bit of the biased exponent and bit 30 being the most significant; and the highest-order bit 31 contains the sign bit, s.


FIGURE 2-1   Single-Storage Format

TABLE 2-2 shows the correspondence between the values of the three constituent fields s, e and f, on the one hand, and the value represented by the single- format bit pattern on the other; u means don't care, that is, the value of the indicated field is irrelevant to the determination of the value of the particular bit patterns in single format.

TABLE 2-2   Values Represented by Bit Patterns in IEEE Single Format  

Single-Format Bit Pattern

Value

0 < e < 255 (-1)s × 2e-127 × 1.f (normal numbers)
e = 0; f 0 (at least one bit in f is nonzero) (-1)s × 2-126 × 0.f (subnormal numbers)
e = 0; f = 0 (all bits in f are zero) (-1)s × 0.0 (signed zero)
s = 0; e = 255; f = 0
(all bits in f are zero)
+INF (positive infinity)
s = 1; e = 255; f = 0
(all bits in f are zero)
-INF (negative infinity)
s = u; e = 255; f 0 (at least one bit in f is nonzero) NaN (Not-a-Number)

Notice that when e < 255, the value assigned to the single format bit pattern is formed by inserting the binary radix point immediately to the left of the fraction's most significant bit, and inserting an implicit bit immediately to the left of the binary point, thus representing in binary positional notation a mixed number (whole number plus fraction, wherein 0 <= fraction < 1).

The mixed number thus formed is called the single-format significand. The implicit bit is so named because its value is not explicitly given in the single- format bit pattern, but is implied by the value of the biased exponent field.

For the single format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to left of the binary point) of a normal number is 1, whereas the leading bit of the significand of a subnormal number is 0. Single-format subnormal numbers were called single-format denormalized numbers in IEEE Standard 754.

The 23-bit fraction combined with the implicit leading significand bit provides 24 bits of precision in single-format normal numbers.

Examples of important bit patterns in the single-storage format are shown in TABLE 2-3. The maximum positive normal number is the largest finite number representable in IEEE single format. The minimum positive subnormal number is the smallest positive number representable in IEEE single format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

TABLE 2-3   Bit Patterns in Single-Storage Format and their IEEE Values

Common Name

Bit Pattern (Hex)

Decimal Value

+0 00000000 0.0
-0 80000000 -0.0
1 3f800000 1.0
2 40000000 2.0
maximum normal number 7f7fffff 3.40282347e+38
minimum positive normal number 00800000 1.17549435e-38
maximum subnormal number 007fffff 1.17549421e-38
minimum positive subnormal number 00000001 1.40129846e-45
+ 7f800000 Infinity
- ff800000 -Infinity
Not-a-Number 7fc00000 NaN

A NaN (Not a Number) can be represented with any of the many bit patterns that satisfy the definition of a NaN. The hex value of the NaN shown in TABLE 2-3 is just one of the many bit patterns that can be used to represent a NaN.

Double Format

The IEEE double format consists of three fields: a 52-bit fraction, f; an 11-bit biased exponent, e; and a 1-bit sign, s. These fields are stored contiguously in two successively addressed 32-bit words, as shown in FIGURE 2-2.

In the SPARC architecture, the higher address 32-bit word contains the least significant 32 bits of the fraction, while in the x86 architecture the lower address 32-bit word contains the least significant 32 bits of the fraction.

If we denote f[31:0] the least significant 32 bits of the fraction, then bit 0 is the least significant bit of the entire fraction and bit 31 is the most significant of the 32 least significant fraction bits.

In the other 32-bit word, bits 0:19 contain the 20 most significant bits of the fraction, f[51:32], with bit 0 being the least significant of these 20 most significant fraction bits, and bit 19 being the most significant bit of the entire fraction; bits 20:30 contain the 11-bit biased exponent, e, with bit 20 being the least significant bit of the biased exponent and bit 30 being the most significant; and the highest-order bit 31 contains the sign bit, s.

FIGURE 2-2 numbers the bits as though the two contiguous 32-bit words were one 64-bit word in which bits 0:51 store the 52-bit fraction, f; bits 52:62 store the 11-bit biased exponent, e; and bit 63 stores the sign bit, s.


FIGURE 2-2   Double-Storage Format

The values of the bit patterns in these three fields determine the value represented by the overall bit pattern.

TABLE 2-4 shows the correspondence between the values of the bits in the three constituent fields, on the one hand, and the value represented by the double-format bit pattern on the other; u means don't care, because the value of the indicated field is irrelevant to the determination of value for the particular bit pattern in double format.

TABLE 2-4   Values Represented by Bit Patterns in IEEE Double Format

Double-Format Bit Pattern

Value

0 < e < 2047 (-1)s × 2e-1023 x 1.f (normal numbers)
e = 0; f 0 (at least one bit in f is nonzero) (-1)s × 2-1022 x 0.f (subnormal numbers)
e = 0; f = 0 (all bits in f are zero) (-1)s × 0.0 (signed zero)
s = 0; e = 2047; f = 0
(all bits in f are zero)
+INF (positive infinity)
s = 1; e = 2047; f = 0
(all bits in f are zero)
-INF (negative infinity)
s = u; e = 2047; f 0 (at least one bit in f is nonzero) NaN (Not-a-Number)

Notice that when e < 2047, the value assigned to the double-format bit pattern is formed by inserting the binary radix point immediately to the left of the fraction's most significant bit, and inserting an implicit bit immediately to the left of the binary point. The number thus formed is called the significand. The implicit bit is so named because its value is not explicitly given in the double-format bit pattern, but is implied by the value of the biased exponent field.

For the double format, the difference between a normal number and a subnormal number is that the leading bit of the significand (the bit to the left of the binary point) of a normal number is 1, whereas the leading bit of the significand of a subnormal number is 0. Double-format subnormal numbers were called double-format denormalized numbers in IEEE Standard 754.

The 52-bit fraction combined with the implicit leading significand bit provides 53 bits of precision in double-format normal numbers.

Examples of important bit patterns in the double-storage format are shown in TABLE 2-5. The bit patterns in the second column appear as two 8-digit hexadecimal numbers. For the SPARC architecture, the left one is the value of the lower addressed 32-bit word, and the right one is the value of the higher addressed 32-bit word, while for the x86 architecture, the left one is the higher addressed word, and the right one is the lower addressed word. The maximum positive normal number is the largest finite number representable in the IEEE double format. The minimum positive subnormal number is the smallest positive number representable in IEEE double format. The minimum positive normal number is often referred to as the underflow threshold. (The decimal values for the maximum and minimum normal and subnormal numbers are approximate; they are correct to the number of figures shown.)

TABLE 2-5   Bit Patterns in Double-Storage Format and their IEEE Values  

Common Name

Bit Pattern (Hex)

Decimal Value

+ 0 00000000 00000000 0.0
- 0 80000000 00000000 -0.0
1 3ff00000 00000000 1.0
2 40000000 00000000 2.0
max normal number 7fefffff ffffffff 1.7976931348623157e+308
min positive normal number 00100000 00000000 2.2250738585072014e-308
max subnormal number 000fffff ffffffff 2.2250738585072009e-308
min positive subnormal number 00000000 00000001 4.9406564584124654e-324
+ 7ff00000 00000000 Infinity
- fff00000 00000000 -Infinity
Not-a-Number 7ff80000 00000000 NaN

A NaN (Not a Number) can be represented by any of the many bit patterns that satisfy the definition of NaN. The hex value of the NaN shown in TABLE 2-5 is just one of the many bit patterns that can be used to represent a NaN.

Ranges and Precisions in Decimal Representation

This section covers the notions of range and precision for a given storage format. It includes the ranges and precisions corresponding to the IEEE single and double formats and to the implementations of IEEE double-extended format on SPARC and x86 architectures. For concreteness, in defining the notions of range and precision we refer to the IEEE single format.

The IEEE standard specifies that 32 bits be used to represent a floating point number in single format. Because there are only finitely many combinations of 32 zeroes and ones, only finitely many numbers can be represented by 32 bits.

One natural question is:

What are the decimal representations of the largest and smallest positive numbers that can be represented in this particular format?

Rephrase the question and introduce the notion of range:

What is the range, in decimal notation, of numbers that can be represented by the IEEE single format?

Taking into account the precise definition of IEEE single format, one can prove that the range of floating-point numbers that can be represented in IEEE single format (if restricted to positive normalized numbers) is as follows:

1.175... × (10-38) to 3.402... × (10+38)

A second question refers to the precision (not to be confused with the accuracy or the number of significant digits) of the numbers represented in a given format. These notions are explained by looking at some pictures and examples.

The IEEE standard for binary floating-point arithmetic specifies the set of numerical values representable in the single format. Remember that this set of numerical values is described as a set of binary floating-point numbers. The significand of the IEEE single format has 23 bits, which together with the implicit leading bit, yield 24 digits (bits) of (binary) precision.

One obtains a different set of numerical values by marking the numbers:

x = (x1.x2 x3...xq) × (10n)

(representable by q decimal digits in the significand) on the number line.

FIGURE 2-5 exemplifies this situation:


FIGURE 2-5   Comparison of a Set of Numbers Defined by Digital and Binary Representation

Notice that the two sets are different. Therefore, estimating the number of significant decimal digits corresponding to 24 significant binary digits, requires reformulating the problem.

Reformulate the problem in terms of converting floating-point numbers between binary representations (the internal format used by the computer) and the decimal format (the format users are usually interested in). In fact, you may want to convert from decimal to binary and back to decimal, as well as convert from binary to decimal and back to binary.

It is important to notice that because the sets of numbers are different, conversions are in general inexact. If done correctly, converting a number from one set to a number in the other set results in choosing one of the two neighboring numbers from the second set (which one specifically is a question related to rounding).

Consider some examples. Suppose one is trying to represent a number with the following decimal representation in IEEE single format:

x = x1.x2 x3... × 10n

Because there are only finitely many real numbers that can be represented exactly in IEEE single format, and not all numbers of the above form are among them, in general it will be impossible to represent such numbers exactly. For example, let

y = 838861.2, z = 1.3

and run the following Fortran program:

 	 REAL Y, Z

 	 Y = 838861.2

 	 Z = 1.3

	 WRITE(*,40) Y

40 	 FORMAT("y: ",1PE18.11)

 	 WRITE(*,50) Z

50 	 FORMAT("z: ",1PE18.11)

The output from this program should be similar to:

y:	 8.38861187500E+05

z:	 1.29999995232E+00

The difference between the value 8.388612 × 105 assigned to y and the value printed out is 0.000000125, which is seven decimal orders of magnitude smaller than y. The accuracy of representing y in IEEE single format is about 6 to 7 significant digits, or that y has about six significant digits if it is to be represented in IEEE single format.

Similarly, the difference between the value 1.3 assigned to z and the value printed out is 0.00000004768, which is eight decimal orders of magnitude smaller than z. The accuracy of representing z in IEEE single format is about 7 to 8 significant digits, or that z has about seven significant digits if it is to be represented in IEEE single format.

Now formulate the question:

Assume you convert a decimal floating point number a to its IEEE single format binary representation b, and then translate b back to a decimal number c; how many orders of magnitude are between a and a - c?

Rephrase the question:

What is the number of significant decimal digits of a in the IEEE single format representation, or how many decimal digits are to be trusted as accurate when one represents x in IEEE single format?

The number of significant decimal digits is always between 6 and 9, that is, at least 6 digits, but not more than 9 digits are accurate (with the exception of cases when the conversions are exact, when infinitely many digits could be accurate).

Conversely, if you convert a binary number in IEEE single format to a decimal number, and then convert it back to binary, generally, you need to use at least 9 decimal digits to ensure that after these two conversions you obtain the number you started from.

The complete picture is given in TABLE 2-10:

TABLE 2-10   Range and Precision of Storage Formats

Format

Significant Digits (Binary)

Smallest Positive Normal Number

Largest Positive Number

Significant Digits (Decimal)

single 24 1.175... 10-38 3.402... 10+38 6-9
double 53 2.225... 10-308 1.797... 10+308 15-17
double extended (SPARC) 113 3.362... 10-4932 1.189... 10+4932 33-36
double extended (x86) 64 3.362... 10-4932 1.189... 10+4932 18-21

Underflow

Underflow occurs, roughly speaking, when the result of an arithmetic operation is so small that it cannot be stored in its intended destination format without suffering a rounding error that is larger than usual.

Underflow Thresholds

TABLE 2-11 shows the underflow thresholds for single, double.

TABLE 2-11   Underflow Thresholds  

Destination Precision

Underflow Threshold

single smallest normal number
largest subnormal number
1.17549435e-38
1.17549421e-38
double smallest normal number
largest subnormal number
2.2250738585072014e-308
2.2250738585072009e-308

The positive subnormal numbers are those numbers between the smallest normal number and zero. Subtracting two (positive) tiny numbers that are near the smallest normal number might produce a subnormal number. Or, dividing the smallest positive normal number by two produces a subnormal result.

The presence of subnormal numbers provides greater precision to floating-point calculations that involve small numbers, although the subnormal numbers themselves have fewer bits of precision than normal numbers. Producing subnormal numbers (rather than returning the answer zero) when the mathematically correct result has magnitude less than the smallest positive normal number is known as gradual underflow.

There are several other ways to deal with such underflow results. One way, common in the past, was to flush those results to zero. This method is known as Store 0 and was the default on most mainframes before the advent of the IEEE Standard.

The mathematicians and computer designers who drafted IEEE Standard 754 considered several alternatives while balancing the desire for a mathematically robust solution with the need to create a standard that could be implemented efficiently.


*portions from here.