Published on February 04th, 2010

Between Fixed and Floating Point

What is between fixed point and floating point? In fact, what does it even mean to speak about being “in between” two computational types? Sometimes two concepts seem so fundamentally different that “betweenness” shouldn’t have a meaning. I would argue that there are important reasons why some sets of computing hardware should natively support something in between the two. Floating point (single and double) is used in all general purpose processors, larger DSPs and GPUs and is invoked wherever float and double are mentioned in a C program. Fixed point is primarily used in ASICs, FPGAs and smaller DSPs as the basic computational method. These two computational types are used for different purposes and each has well-known advantages and disadvantages. Floating point has a much larger dynamic range and constant relative error, while fixed point uses much less hardware and is faster. Also, floating point algorithms are much easier to design and debug, while fixed point algorithms require constant attention to binary point location and word widths. So, what is the gap between them and how could it be filled? The gap can be looked at in different ways:

    1. The gap occurs when an on-chip calculation requires high precision and/or dynamic range, but the hardware resources are not sufficient. For example, a floating point solution requires too many floating point units (FPUs) and a fixed point solution requires a datapath that is so wide, it exhausts the routing resources.
    2. The gap is also seen when a fixed point design is required by the hardware, but the designer only has enough time in the schedule for a floating point algorithm design. Of course, most embedded designers would love to have the ability to design all their computational datapaths with “pure” floating point and not worry about the consequences to performance, timing and resources.

In fact, many on-chip computations require some features of each type of computational element and no readily available standard set of formats and hardware support exists. This lack forces an unnatural dichotomy for designers. It forces an embedded designer to choose between a floating point DSP and a fixed point DSP; it forces an ASIC designer to add on floating point cores and external processor support; it forces difficult bit width trade-offs in FPGA designs.

A List of Desirable Features
Let’s look in more detail at some of the computational features desired by embedded designers. These features include:

    1. Large dynamic range. It is more than just annoying when a fixed point result overflows and wraps around, it is often catastrophic! However, avoiding this in fixed point can cause a great deal of precision loss if overflow must be guaranteed not to happen.
    2. Small data path size. Routing is always a major concern in ASICs and FPGAs, so computational types must keep the data path width to a reasonable size at the expense of loss of precision.
    3. Numeric precision. The calculations need to remain “essentially correct” after going through many rounding/truncation operations.
    4. Small fixed latency for addition and multiplication. Imagine trying to handle variable timing in your pipelined design for each calculation based on the values involved! CPUs must use an entirely difference architecture (executing a stored program from main memory) to benefit from this.
    5. Low complexity. Silicon area on an ASIC or resources in an FPGA are limited no matter what salesmen claim. Murphy almost guarantees that your design will overfill whatever resources you have.
    6. Adherence to a widely accepted standard. IEEE 754, the floating point standard, is a prime example in the numeric computing world. Having an equally accepted standard for other computational types would be highly desirable.

Numbers 1, 2 and 3 directly conflict and one resolution that has been proposed over the years is to allow the precision to decrease as the size of the numbers grows (or shrinks) outside a “normal” range where most calculations reside. In fact, [5] shows that numbers involved in calculations typically follow a Gaussian distribution in magnitude. Numbers 3 and 4 conflict as well and have been addressed in DSPs and GPUs for floating point by simplifying from the IEEE 754 standard. These include simplified rounding modes, no denormalized numbers and fewer exception types handled. Number 5, on the other hand, is a “simple” problem, only involving politics among many unorganized entities, competing commercial interests, and intellectual property conflicts.

IEEE 754 Floating Point
Before considering how to meet these needs, it is worth examining the complexity this floating point standard demands ([6] notes that floating point takes up almost 3X the hardware of fixed-point math). The IEEE 754 standard [4] specifies many items:

    1. Basic and extended floating-point number formats
    2. Add, subtract, multiply, divide, square root, remainder, and compare operations
    3. Conversions between integer and floating-point formats
    4. Conversions between different floating-point formats
    5. Conversions between basic format floating-point numbers and decimal strings
    6. Floating-point exceptions and rounding

The basic format is divided into a 32 bit single-precision and 64 bit double-precision data type as shown in Figure 1. Here the ones complement significand (also called the mantissa) has an implied leading bit that is not stored.


Figure 1. IEEE 754 single (a) and double precision (b).

The numbers represented by the single-precision format are:

    = (-1)s2e ×…x (normalized) when E > 0 else
    = (-1)s2-126 ×…x (denormalized)

where E is the stored biased exponent and e is the unbiased exponent with e = E – 127.

Six types of exceptions are defined and signaled through a status flag, including Invalid Operation, Inexact, Underflow, Overflow, Infinity and Zero. Three extra bits are used to support the five rounding modes: “Round to nearest even”, “Round to nearest away from zero”, “Round to zero”, “Round up”, and “Round down”.

Implementation Costs
The FPGA cost for different floating-point bit width formats as estimated in [10] is shown in Figure 2. Here the 16-bit format uses a 4-bit exponent with an 11-bit fraction, the 32-bit format is the IEEE single precision format, and the 40-bit format uses a 10-bit exponent with a 29-bit fraction. As can be seen, a doubling of the bit width results in almost 3 times the resources used. Thus for an embedded application, it is critically important to use as small a bit width as possible.


Figure 2. The cost in LUTs and flip-flops for floating-point multiplication.

The costs of different optional features for floating-point multipliers as estimated in [10] are shown in Figure 3. That paper shows there is significant cost to adding gradual underflow (denormalized numbers), since a shifter is required for normalizing the product. Also the normalization units (i.e. barrel shifters) contribute significantly to the cost of the adder. Moreover, inclusion of proper rounding adds 10–15% to the cost, which is similar to the cost of adding rounding in the adder implementation.


Figure 3. Floating-point multiplier implementation cost (scale is logarithmic)

GPU Simplifications
Graphics Processing Units (GPUs) have simplified their floating point implementations as a result. NVIDIA GPUs which are CUDA compatible follow the IEEE-754 standard for 32 bit floating-point arithmetic with the following deviations (see [8]):

  • There is no dynamically configurable rounding mode.
  • There is no mechanism for detecting a floating-point exception.
  • Absolute value and negation are not compliant with IEEE-754 with respect to NaNs; these are passed through unchanged.

In the case of single-precision floating-point numbers:

  • Denormalized numbers are not supported.
  • Results which underflow are flushed to zero.
  • The result of an operation involving one or more input NaNs is the quiet NaN.

Also, some instructions are not IEEE-compliant:

  • Addition and multiplication are often combined into a single multiply-add instruction (FMAD), which truncates the intermediate result of the multiplication.
  • For addition and multiplication, only round-to-nearest-even and round-towards-zero are supported via static rounding modes.

Even with their non-standard implementation, GPUs have been very successful doing high performance embedded computations where FPGAs or banks of DSPs would previously have been used (note that the new FERMI architecture addresses some of these deficiencies).

Fixed and Floating Point in VHDL
While GPUs have tried to simplify their floating point implementations so that a large number of floating adder/multipliers could fit on a single GPU, FPGAs have instead incorporated a large number of “hard MACs” (typically 18x18 integer multiplier/adders) in addition to the LUTs and flip/flops to address their customers’ computing needs. Custom ASICs rely entirely on the designer (and any IP suppliers), so the computing needs have been addressed (quite recently) with the introduction of fixed point and floating point into the VHDL (post 2005) language. However, synthesis support for these language extensions is only now beginning to be introduced into FPGAs and ASICs. Without widespread synthesis support for these features described below, they are really little more than convenient simulation tools for designers. The report [6] describes designers’ usage of computation types as follows:

    “Designers tend to use math solutions in order of “integer math”, “fixed point math” and “floating point math”, where 80% of designs are done in integer; of the next 20%, 80% of those are done in fixed point.”

While these can only be crude estimates of total usage, it would be interesting to compare them against other options if such were readily available. It is worth examining fixed and floating VHDL constructs in some more detail, see [7].

Fixed Point. The fixed-point math packages are based on the VHDL 1076.3 NUMERIC_STD package. Two new data types “ufixed” (unsigned fixed point) and “sfixed” (signed fixed point) were created with generic definitions as follows:

    type ufixed is array (integer range <>) of std_logic;
    type sfixed is array (integer range <>) of std_logic;

Here is an example showing usage:

    use ieee.fixed_pkg.all;
    signal a, b : sfixed (7 downto -6);

The decimal point specified in parentheses requests a 14 bit wide value with 8 bits (7-0) to the left and 6 to the right of the decimal point. This package also defines 3 constants that are used to modify fixed-point arithmetic behavior:

    constant fixed_round : boolean := true; -- Round or truncate
    constant fixed_saturate : boolean := true -- saturate or wrap
    constant fixed_guard_bits : natural := 3; -- guard bits for rounding

The "guard_bits" default to "fixed_guard_bits" which defaults to 3, just as is used in standard floating point. Thus the additions to the standard language allow the designer to be much more precise and expressive about fixed point calculations. Nevertheless, when synthesized with an 18x18 MAC instantiated, a number of LUTs must also be used as well in order to turn a raw integer calculation into a proper fixed point calculation.

Floating Point. IEEE 754 floating point support has also been added to the VHDL language together with extensions that allow modifications from its defaults. Here is the generic definition:

    package fphdl32_pkg is new IEEE.fphdl_pkg
    generic map (
    fp_fraction_width => 23; -- 23 bits of fraction
    fp_exponent_width => 8; -- exponent 8 bits
    fp_round_style => round_nearest; -- round nearest algorithm
    fp_denormalize => true; -- Turn on Denormalized numbers
    fp_check_error => true; -- Turn on NAN and overflow processing
    fp_guard_bits => 3); -- number of guard bits

The actual floating-point type is defined as follows:

    type fp is array (fp_exponent_width downto -fp_fraction_width) of STD_LOGIC;

Thus floating point formats with different bit widths from the standard ones can be chosen, together with options for round_style, denormalize, check_error and guard_bits.

Previous Proposals for New Formats
The GPU and VHDL-enabled extensions have not addressed issues like numeric precision and standardization, so let us examine some other proposed solutions for simplified computational formats:

    1. Binary16. In IEEE 754-2008 a 16-bit floating point format (also called half precision) is specified. It was created by Industrial Light & Magic (ILM) to handle large image dynamic ranges without the memory cost of single precision format, but was not intended for computation. Its format uses 1 sign bit, 5 bits of biased exponent and 10 bits of ones complement significand with implicit leading bit. It is used in several computer graphic standards including OpenGL, Cg, and D3DX with texture hardware support in GPUs. Binary16 has been analyzed in [1] for use in DSP computations. As can be expected for such a short mantissa, its numerical performance is poor.
    2. Tapered floating point. In standard floating point, exponents occupy the same space regardless of their value. Tapered floating point, described in [3], attempts to distribute the storage between the exponent and mantissa. The G, S and W fields are shown in Figure 4. The G field stores the width of the exponent within the W field, which has both exponent E and ones complement mantissa M; the sign bit lies in between.

Figure 4. Tapered floating point format.

To encode both G and the exponent, a systematic packing algorithm similar to Huffman coding is used as shown in Table 1. This packing distributes numbers so that wider mantissas are possible for numbers near one in magnitude and narrower when they are both smaller and larger.

Its main disadvantage is that the G field is always present, reducing the number of bits available to the mantissa.


Table 1. Exponent packing for tapered floating point.

    3. 16 bit fractional. In [1], Richey and Saiedian proposed a 16 bit fractional format for low power 16 bit DSPs using two exponent fields as shown in Figure 5.

Figure 5. Richey and Saiedian 16 bit fractional format.

With only 16 bits, fixed point can have inadequate dynamic range and noise performance , with no FPU available either. The proposed solution has a variable size ones complement mantissa with the widest mantissas at the top of the range, falling off gradually to the narrowest mantissa at the bottom of the range. The encoding of the “exponent” uses dual fields at both ends of the number. Table 2 shows the first set of numeric ranges as proposed in [1] for use in DSP applications (they propose another set of ranges as well that fit within a 32 bit fractional word). Their paper showed a large improvement in numeric accuracy over the Binary16 floating point format.

The main disadvantages to this approach are that it is one-sided (only non-positive exponents), although that could be fixed with a different exponent assignment, and it is inefficient with bit usage as the exponents get further away from zero.


Table 2. Example Ranges for the dual exponent fields.

An alternative proposal
To meet the needs of ASIC designers, FPGA designers and DSP engineers and even some GPU graphics algorithms requires a family of specific formats and computational behavior that covers a range of different behaviors without sacrificing performance when word widths are narrow. For example, whereas IEEE 754 defined only 32 and 64 bit floating point, a much wider range of widths is necessary to meet these wider application areas, especially in the range of 12-24 bits. This requires a parameterized set of standards, just as has been done in VHDL. Note that for byte alignment reasons, DSPs and GPUs typically require a 16 bit aligned format (also 24 bits for some DSPs). However, word widths for ASICs and FPGAs can of course be any number of bits, though there is a preference for FPGAs to have multiples of 2 or 4 due to the LUT architecture. And native support is necessary for performance on FPGAs; they would need a “hard MAC++” that has an enhanced computational element using only a small amount of extra resources.

Taking the best of these various options, we introduce an alternative that has the following features:

  • No exception handling – addition and multiplication always produce an answer, there is no “infinity”; there is simply a largest and smallest value possible. The behavior of division and square roots is left unspecified.
  • Rounding is optional – the cost of “rounding to nearest even” in floating point is a separate addition with possible bit shift and exponent adjustment, clearly an expensive operation.
  • Efficient usage of bits for narrow formats – our exponent takes up fewer bits when numbers are near one in absolute value.
  • Variable length mantissa and exponent – short exponents for numbers near one allow the mantissa to be longer.

with the following additions:

  • Simplified normalization – non-consecutive exponents are allowed, which can reduce barrel shifter size and delay while keeping dynamic range constant. For example, the notation L=2 denotes that only even exponents (shifts of two) are represented, while L=4 would imply shifts operate at the nibble level.
  • Two’s complement mantissa – allows simple conversion to and from fixed point.
  • An exponent labeled using the Elias Gamma code, see [2].

The format shown in Figure 6 has several interesting features. The sign bit is implied as part of the two’s complement mantissa, but does not need to be handled separately. Since the sign bit is the leading bit, this format could be tested for sign using normal DSP operations. Also, a conversion to fixed point simply requires one or two arithmetic shifts of the right amount and an approximate mantissa can be used with no conversion at all (the exponent bits become lower bit “noise’).


Figure 6. Alternative Format.

Normalization is typically implemented using bit shifter circuits such as a barrel shifter shown in Figure 7 and discussed in [9].


Figure 7. Barrel Shifter Schematic.

An N bit wide barrel shifter with N bit range (sufficient for floating point) requires O(N2) transistors with O(N) input capacitance and O(1) delay, while an N bit logarithmic shifter requires O(N log2(N)) transistors with O(log2(N)) delay and longer interstage routes. While an N bit shift range is still necessary for non-consecutive exponents, it is also the number of bit shift positions that affects the size and speed of the shifting circuit. For example, with only N/2 shift positions possible, the barrel shifter is twice as small. The effect of a reduced number of bit shift positions and two’s complement mantissa are several fold:

  • There is no concept of a hidden leading bit, the leading bits are simply sign bit(s) of the two’s complement mantissa.
  • Underflow is detected with a “leading equal bit counter” rather than a “leading zero bit counter”.
  • Relative error is the price to be paid for a simplification in normalization; the gain is in flexibility to meet timing and sizing constraints.

The Elias Gamma code shown in Table 3 allows for a variable length exponent without requiring an exponent width field (it is bit reversed in the format so that trailing zero counting locates the exponent). This code allows ~E/2 exponent bits for E bits in width. Note that larger values of L allow a greater range at a reduced accuracy.


Table 3. Elias Gamma code used as the exponent label.

Finally, Table 4 gives a simple comparison of the four number formats discussed in this article.


Table 4. Comparison between four number formats for a width of 16 bits.

To summarize, the alternative format being discussed has the following important features:

  • Width in N bits is variable to cover primary useful sizes of 12, 16, 20 and 24 bits.
  • The exponent has variable width and is attached in the low order bits, rather than the high order bits, easing usage on DSPs and GPUs.
  • Normalization has been simplified using a fixed exponent step size L, tied to the implementation technology being used (DSPs and GPUs), and in the case of an ASIC or FPGA, the dynamic range desired.
  • Addition has been simplified in several ways compared to floating point: no conversion to/from ones complement, simplified shifts for decimal point alignment, and optional rounding.


    [1] Manual Richey and Hossein Saiedian, “A New Class of Floating-Point Data Formats with Applications to 16-Bit Digital-Signal Processing Systems”, IEEE Communications Magazine, July 2009.
    [2] P. Elias, "Universal codeword sets and representations of the integers," IEEE Trans. Inform. Theory, vol. IT-21, no. 2, pp. 194-203, 1975.
    [3] R. Morris, "Tapered floating point: A new floating-point representation," IEEE Trans. Computing, vol. C-20, no. 6, pp. 1578-1579, 1971.
    [4] ANSI/IEEE Std. 754-1985: "IEEE Standard for Binary Floating Point Arithmetic," New York: ANSI/IEEE, 1985.
    [5] R.W. Hamming, “On the Distribution of Numbers”, Bell Syst. Tech. J. Vol 49, Oct. 1970, pp. 1609-1625.
    [6] David Bishop, “Fixed- and floating-point packages for VHDL 2005”, DVCON, 2005.
    [7] David Bishop, “Floating point package user’s guide”,
    [8] NVIDIA CUDA Programming Guide, v2.2,
    [9] K. Acken, M. J. Irwin, R. Owens, Power Comparisons for Barrel Shifters”, ISLPED 1996, Monterey, CA, 1996.
    [10] Allan Jaenicke and Wayne Luk, “Parameterized Floating Point Arithmetic on FPGAs”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 897-900, 2001.

Dr. Gary Ray is a Technical Fellow at Boeing in the Boeing Research and Technology division. He has 25 years experience in signal, communications and image processing, including several years at Hughes Aircraft and Westinghouse Hanford. He has published over 20 papers and was group lead at both Hughes Aerospace and the Boeing High Technology Center. Gary earned his doctorate from the University of Washington.

Tech Videos

©2019 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS

Extension Media websites place cookies on your device to give you the best user experience. By using our websites, you agree to placement of these cookies and to our Privacy Policy. Please click here to accept.