Download the latest issue of the Electronic Systems Design Engineering
and subscribe to receive future issues and the email newsletter
Published on February 04th, 2010
What is between fixed point and floating point? In fact, what does it even mean to speak about being “in between” two computational types? Sometimes two concepts seem so fundamentally different that “betweenness” shouldn’t have a meaning. I would argue that there are important reasons why some sets of computing hardware should natively support something in between the two. Floating point (single and double) is used in all general purpose processors, larger DSPs and GPUs and is invoked wherever float and double are mentioned in a C program. Fixed point is primarily used in ASICs, FPGAs and smaller DSPs as the basic computational method. These two computational types are used for different purposes and each has well-known advantages and disadvantages. Floating point has a much larger dynamic range and constant relative error, while fixed point uses much less hardware and is faster. Also, floating point algorithms are much easier to design and debug, while fixed point algorithms require constant attention to binary point location and word widths. So, what is the gap between them and how could it be filled? The gap can be looked at in different ways:
In fact, many on-chip computations require some features of each type of computational element and no readily available standard set of formats and hardware support exists. This lack forces an unnatural dichotomy for designers. It forces an embedded designer to choose between a floating point DSP and a fixed point DSP; it forces an ASIC designer to add on floating point cores and external processor support; it forces difficult bit width trade-offs in FPGA designs.
A List of Desirable Features
Let’s look in more detail at some of the computational features desired by embedded designers. These features include:
Numbers 1, 2 and 3 directly conflict and one resolution that has been proposed over the years is to allow the precision to decrease as the size of the numbers grows (or shrinks) outside a “normal” range where most calculations reside. In fact,  shows that numbers involved in calculations typically follow a Gaussian distribution in magnitude. Numbers 3 and 4 conflict as well and have been addressed in DSPs and GPUs for floating point by simplifying from the IEEE 754 standard. These include simplified rounding modes, no denormalized numbers and fewer exception types handled. Number 5, on the other hand, is a “simple” problem, only involving politics among many unorganized entities, competing commercial interests, and intellectual property conflicts.
IEEE 754 Floating Point
Before considering how to meet these needs, it is worth examining the complexity this floating point standard demands ( notes that floating point takes up almost 3X the hardware of fixed-point math). The IEEE 754 standard  specifies many items:
The basic format is divided into a 32 bit single-precision and 64 bit double-precision data type as shown in Figure 1. Here the ones complement significand (also called the mantissa) has an implied leading bit that is not stored.
Figure 1. IEEE 754 single (a) and double precision (b).
The numbers represented by the single-precision format are:
where E is the stored biased exponent and e is the unbiased exponent with e = E – 127.
Six types of exceptions are defined and signaled through a status flag, including Invalid Operation, Inexact, Underflow, Overflow, Infinity and Zero. Three extra bits are used to support the five rounding modes: “Round to nearest even”, “Round to nearest away from zero”, “Round to zero”, “Round up”, and “Round down”.
The FPGA cost for different floating-point bit width formats as estimated in  is shown in Figure 2. Here the 16-bit format uses a 4-bit exponent with an 11-bit fraction, the 32-bit format is the IEEE single precision format, and the 40-bit format uses a 10-bit exponent with a 29-bit fraction. As can be seen, a doubling of the bit width results in almost 3 times the resources used. Thus for an embedded application, it is critically important to use as small a bit width as possible.
Figure 2. The cost in LUTs and flip-flops for floating-point multiplication.
The costs of different optional features for floating-point multipliers as estimated in  are shown in Figure 3. That paper shows there is significant cost to adding gradual underflow (denormalized numbers), since a shifter is required for normalizing the product. Also the normalization units (i.e. barrel shifters) contribute significantly to the cost of the adder. Moreover, inclusion of proper rounding adds 10–15% to the cost, which is similar to the cost of adding rounding in the adder implementation.
Figure 3. Floating-point multiplier implementation cost (scale is logarithmic)
Graphics Processing Units (GPUs) have simplified their floating point implementations as a result. NVIDIA GPUs which are CUDA compatible follow the IEEE-754 standard for 32 bit floating-point arithmetic with the following deviations (see ):
In the case of single-precision floating-point numbers:
Also, some instructions are not IEEE-compliant:
Even with their non-standard implementation, GPUs have been very successful doing high performance embedded computations where FPGAs or banks of DSPs would previously have been used (note that the new FERMI architecture addresses some of these deficiencies).
Fixed and Floating Point in VHDL
While GPUs have tried to simplify their floating point implementations so that a large number of floating adder/multipliers could fit on a single GPU, FPGAs have instead incorporated a large number of “hard MACs” (typically 18x18 integer multiplier/adders) in addition to the LUTs and flip/flops to address their customers’ computing needs. Custom ASICs rely entirely on the designer (and any IP suppliers), so the computing needs have been addressed (quite recently) with the introduction of fixed point and floating point into the VHDL (post 2005) language. However, synthesis support for these language extensions is only now beginning to be introduced into FPGAs and ASICs. Without widespread synthesis support for these features described below, they are really little more than convenient simulation tools for designers. The report  describes designers’ usage of computation types as follows:
While these can only be crude estimates of total usage, it would be interesting to compare them against other options if such were readily available. It is worth examining fixed and floating VHDL constructs in some more detail, see .
Fixed Point. The fixed-point math packages are based on the VHDL 1076.3 NUMERIC_STD package. Two new data types “ufixed” (unsigned fixed point) and “sfixed” (signed fixed point) were created with generic definitions as follows:
Here is an example showing usage:
The decimal point specified in parentheses requests a 14 bit wide value with 8 bits (7-0) to the left and 6 to the right of the decimal point. This package also defines 3 constants that are used to modify fixed-point arithmetic behavior:
The "guard_bits" default to "fixed_guard_bits" which defaults to 3, just as is used in standard floating point. Thus the additions to the standard language allow the designer to be much more precise and expressive about fixed point calculations. Nevertheless, when synthesized with an 18x18 MAC instantiated, a number of LUTs must also be used as well in order to turn a raw integer calculation into a proper fixed point calculation.
Floating Point. IEEE 754 floating point support has also been added to the VHDL language together with extensions that allow modifications from its defaults. Here is the generic definition:
The actual floating-point type is defined as follows:
Thus floating point formats with different bit widths from the standard ones can be chosen, together with options for round_style, denormalize, check_error and guard_bits.
Previous Proposals for New Formats
The GPU and VHDL-enabled extensions have not addressed issues like numeric precision and standardization, so let us examine some other proposed solutions for simplified computational formats:
Figure 4. Tapered floating point format.
To encode both G and the exponent, a systematic packing algorithm similar to Huffman coding is used as shown in Table 1. This packing distributes numbers so that wider mantissas are possible for numbers near one in magnitude and narrower when they are both smaller and larger.
Its main disadvantage is that the G field is always present, reducing the number of bits available to the mantissa.
Table 1. Exponent packing for tapered floating point.
Figure 5. Richey and Saiedian 16 bit fractional format.
With only 16 bits, fixed point can have inadequate dynamic range and noise performance , with no FPU available either. The proposed solution has a variable size ones complement mantissa with the widest mantissas at the top of the range, falling off gradually to the narrowest mantissa at the bottom of the range. The encoding of the “exponent” uses dual fields at both ends of the number. Table 2 shows the first set of numeric ranges as proposed in  for use in DSP applications (they propose another set of ranges as well that fit within a 32 bit fractional word). Their paper showed a large improvement in numeric accuracy over the Binary16 floating point format.
The main disadvantages to this approach are that it is one-sided (only non-positive exponents), although that could be fixed with a different exponent assignment, and it is inefficient with bit usage as the exponents get further away from zero.
Table 2. Example Ranges for the dual exponent fields.
An alternative proposal
To meet the needs of ASIC designers, FPGA designers and DSP engineers and even some GPU graphics algorithms requires a family of specific formats and computational behavior that covers a range of different behaviors without sacrificing performance when word widths are narrow. For example, whereas IEEE 754 defined only 32 and 64 bit floating point, a much wider range of widths is necessary to meet these wider application areas, especially in the range of 12-24 bits. This requires a parameterized set of standards, just as has been done in VHDL. Note that for byte alignment reasons, DSPs and GPUs typically require a 16 bit aligned format (also 24 bits for some DSPs). However, word widths for ASICs and FPGAs can of course be any number of bits, though there is a preference for FPGAs to have multiples of 2 or 4 due to the LUT architecture. And native support is necessary for performance on FPGAs; they would need a “hard MAC++” that has an enhanced computational element using only a small amount of extra resources.
Taking the best of these various options, we introduce an alternative that has the following features:
with the following additions:
The format shown in Figure 6 has several interesting features. The sign bit is implied as part of the two’s complement mantissa, but does not need to be handled separately. Since the sign bit is the leading bit, this format could be tested for sign using normal DSP operations. Also, a conversion to fixed point simply requires one or two arithmetic shifts of the right amount and an approximate mantissa can be used with no conversion at all (the exponent bits become lower bit “noise’).
Figure 6. Alternative Format.
Normalization is typically implemented using bit shifter circuits such as a barrel shifter shown in Figure 7 and discussed in .
Figure 7. Barrel Shifter Schematic.
An N bit wide barrel shifter with N bit range (sufficient for floating point) requires O(N2) transistors with O(N) input capacitance and O(1) delay, while an N bit logarithmic shifter requires O(N log2(N)) transistors with O(log2(N)) delay and longer interstage routes. While an N bit shift range is still necessary for non-consecutive exponents, it is also the number of bit shift positions that affects the size and speed of the shifting circuit. For example, with only N/2 shift positions possible, the barrel shifter is twice as small. The effect of a reduced number of bit shift positions and two’s complement mantissa are several fold:
The Elias Gamma code shown in Table 3 allows for a variable length exponent without requiring an exponent width field (it is bit reversed in the format so that trailing zero counting locates the exponent). This code allows ~E/2 exponent bits for E bits in width. Note that larger values of L allow a greater range at a reduced accuracy.
Table 3. Elias Gamma code used as the exponent label.
Finally, Table 4 gives a simple comparison of the four number formats discussed in this article.
Table 4. Comparison between four number formats for a width of 16 bits.
To summarize, the alternative format being discussed has the following important features:
Dr. Gary Ray is a Technical Fellow at Boeing in the Boeing Research and Technology division. He has 25 years experience in signal, communications and image processing, including several years at Hughes Aircraft and Westinghouse Hanford. He has published over 20 papers and was group lead at both Hughes Aerospace and the Boeing High Technology Center. Gary earned his doctorate from the University of Washington.