High-Performance Video Hardware in "No Time"It is hard to make the argument that anything other than high level synthesis (HLS) should be used when addressing present day design challenges.
High Definition Television (HDTV), DVD, set top boxes, digital cameras, and digital projectors, to name a few, are all consumer products that share many of the same digital "DNA". To capitalize on this commonality, video chip makers are taking the same core algorithm and re-targeting it to a diverse range of applications. To do this successfully, however, they must overcome a number of challenges – one of the biggest being able to deliver a variety of features and performances over process and technology, in a timely fashion.
One of the key factors to a chip maker's success in the consumer electronics industry is time to market (TTM). Although this phrase means many things to other markets, for consumer video electronics it means just one thing: the Consumer Electronics Show (CES) held every January. This is the "drop dead" date for chip makers to show new features to their customers, with the hopes their chips make it into consumer products targeted at the holiday season the following December. With first silicon due by the end of September prior to CES, there is little time for late algorithmic changes to make it into the current chip, which means new features must wait to be rolled into the next chip. For fabless startups, this kind of TTM pressures can make or break a company, for large chip makers this can determine the success or failure of a product line.
With Moore's Law marching steadily on, it becomes increasingly difficult to rapidly deliver the rich set of features needed in these video chips while remaining competitive on cost, performance, and flexibility. Because of increased chip capacity, engineers are being asked to create more with less, making traditional RTL design impractical for developing next generation algorithms in hardware. Consequently, designers are looking for new methodologies to help tackle this problem.
Fortunately, high-level C++ synthesis has reached maturity over the past several years, proving itself to be the ideal replacement for these outdated design practices. It allows designers to quickly implement complex features to differentiate themselves from the competition. Using high level synthesis (HLS), designers can easily incorporate last minute algorithmic changes before code freeze/tapeout or reuse core designs, targeting them to multiple applications and features.
Practical Example: A Video Scaler Application
HLS has become successful because of the increasing challenges of meeting TTM requirements for even relatively simple designs using traditional hand coded RTL methods. A good example of this is the "video scaler", which converts one video format to another (Figure 1), and is found in a wide range of products such as HDTV, DVD, cell phones, STB, etc.
Figure 1. Effects of video scaling.
The video scaler typically consists of two parts: a vertical scaler, followed by a horizontal scaler. When used together, these two blocks can stretch or shrink video or image data. Looking inside these blocks reveals that the core algorithm for both of them is a 4- or 5-tap polyphase filter, which is implemented using a finite impulse response (FIR) filter with multiple sets of selectable coefficients. The polyphase FIR can be separated into four parts; interface, memory architecture, coefficient selection, and multiply-accumulate (MAC). Although HLS can have a huge impact on all four parts of the polyphase filter, focusing on the issues around designing the MAC gives the best example of how HLS can help designers be significantly more productive.
One of the biggest obstacles that a hardware designer faces when coding RTL by hand is "time". We're not talking about the time it takes to get the product to market, but instead the maximum clock frequency in which the design can operate. Hardware designers have no choice but to consider "time" when hand coding RTL. The reasons for this can be clearly seen by looking at the FIR MAC, designed for maximum throughput and sufficiently low clock frequency (Figure 2).
Figure 2. Parallel FIR MAC.
In this example, all of the multiplications of filter tap values against the coefficients are performed in parallel and then summed in an adder tree, all within a single clock cycle. There is a minimum clock period beyond which this circuit will not function. This design is essentially "locked down" for a specific frequency and process. So if the designer is then asked to run the FIR filter twice as fast as the original design, or half the clock period (Tmin/2), they will be forced to manually redesign their filter in RTL to add additional pipelining registers into the data path (Figure 3).
Figure 3. Parallel FIR MAC with pipelining.
In addition to creating pipeline registers, they may also have to redesign the data path controller to account for the increased latency. This requires days or even weeks of rewriting RTL code – something a RTL designer always faces when they retarget their design to a different clock frequency, process, or throughput requirements.
In contrast, raising the design to a higher level of abstraction allows the HLS tools to automate this change without touching the source code. Although there is still an ongoing debate over the language of choice for HLS, there is universal agreement that the core algorithmic description should be untimed. Using an untimed description allows a design to be retargeted to different process and performance requirements, independent of the clock frequency.
HLS makes this possible by automatically inserting the pipelining where needed, based only on the clock frequency and the individual operator (adders, multipliers, etc) delays. This is part of the HLS process known as "scheduling" and delivers one of the biggest benefits in that it allows designers to write code without worrying about "time". So instead of worrying about how to describe in RTL the timing, parallelism, and all the other implementation details of the filter MAC, the hardware designer is left with something as straightforward as:
acc += reg[i] * coeff[i];
Examination of the C++ code for implementing the MAC shows that there is no concept of either timing or parallelism in the expression. The question that arises is how does one specify the micro-architectural details such as how many multipliers and adders are required for a given implementation in order to meet the design throughput requirements? The answer is simple. HLS allows users to perform what is know as "loop unrolling" which replicates the loop body and schedules the operations in parallel (Figure 4). This gives the designer the ability to explore the available design space by trading off area versus performance.
Figure 4. The effects of loop unrolling.
Supporting Multiple Interfaces
For most designers, the scenario just described is reason enough to adopt a HLS design methodology. But there are additional benefits. The wide range of possible applications for the video scaler often require designers to create hardware for different clock frequencies and processes, and also to support many different types of interfaces. It is essential that designers be able to easily retarget their designs to multiple interfaces.
Some HLS tools support what is known as "interface synthesis", which allows designers to specify a timed protocol for each of the top-level design interfaces. The designer is free to choose between wire, bus, streaming with handshake, or memory interfaces, to name a few, while leaving the source description untimed. Perhaps even more important, the designer can tune the internal bandwidth to the interface bandwidth, maximizing performance and minimizing power consumption. This enables the designer, for example, to target the video scaler to 8, 16, and 32-bit interfaces even though the core algorithm operates on 8-bit video data. The designer can simultaneously widen the bit widths of the interfaces while increasing internal parallelism via loop unrolling, allowing the design throughput to be matched to the available bus bandwidth.
Matching internal and interface bandwidths often requires the designer to consider the memory architecture of the design. In the case of the horizontal scaler, registers can be used as the storage elements since only a few pixels are stored and operated on at any given moment. However the vertical scaler requires more complicated memory architecture because it stores multiple lines of pixel data, where each line may contain several thousand pixels. The choice of target technology and performance requirements will also heavily influence the underlying memory architecture.
Working from an abstract language such as C++ allows designers to quickly explore different implementations such as shifting or circulating buffers (Figure 5).Using HLS allows abstract descriptions using arrays or arrays of pointers to be mapped to memories. Furthermore, designers can automatically "split" their memories by bit-width or word boundary by applying constraints, enabling them to build place-and-route friendly RTL.
Figure 5. Memory Architecture.
The Benefits of Fixed-Point Data Types
The discussion of the benefits of HLS would not be complete without talking about the support of fixed-point data types. For years hardware designers have struggled with hand-coding RTL for fixed-point arithmetic. The traditional approach was to manually convert fixed point operations to integer by left-shifting the data, performing the arithmetic, and then converting it back to fixed point by right-shifting by the appropriate amount. But this multi-step manipulation of the decimal point invariably introduces bugs into a design.
Structures such as the polyphase filter are naturally represented using fixed point data types where filter coefficients often contain both integer and fractional parts to maintain unity gain. Fixed-point data types also allow the coefficient selection calculation to be done in fixed-point arithmetic. The selection of the polyphase coefficient is performed based on the ratio of input image to output image size, typically resulting in fractional data. When video up-scaling this fractional value is between 0 and 1 it can be used to calculate where to interpolate between the closest 4 neighboring pixels (Figure 6). The most significant bits of the fractional data operation can be used to generate the address into the coefficient table which contains weighted values based on the interpolation point.
Figure 6. Nearest-neighbor coefficient selection
Attempting the coefficient selection calculation in integer arithmetic becomes more complicated as the number of filter phases changes. This will affect the required number of integer bits as well as how much data must be shifted to account for the decimal point. Using fixed point data types, however, allows the most significant bits to always be used for the coefficient selection without having to shift the data, simplifying the design.
The other primary benefit designing in fixed-point data types is the integrated handling of saturation and rounding. Saturation prevents an arithmetic operation from exceeding a threshold, typically the maximum allowable value for that data type. Rounding removes the number of significant digits from the resulting assignment. Both of these operations can be expressed as an arithmetic expression in C++, typically resulting in twos complement hardware.
By having the saturation and rounding built-in to the fixed-point data type, the twos complement hardware can be replaced with AND/OR logic resulting in a significant reduction in area. Furthermore, the fixed-point data types can support a number of saturation and rounding modes that are selectable via C++ variable template parameters, allowing the designer to explore the effects of normal versus symmetrical saturation or rounding to plus or minus infinity.
Taking the "Time" Out of Time to Market
Any one of the challenges of designing the video scaler (polyphase FIR filter) in and of itself is justification for moving to a HLS design methodology. But when taken all together, it is hard to make the argument that anything other than HLS should be used when addressing these present day design challenges.
The abstractness of C++ coupled with built-in fixed-point arithmetic support enables designers to easily implement complex algorithms and quickly roll last minute changes into production ready RTL. Then, interface synthesis combined with loop unrolling gives them the flexibility to retarget a single design to multiple applications and performance requirements. All of this combined with "scheduling" means that TTM doesn't mean much anymore when designing high performance video hardware, in "No Time."
Mike Fingeroff has worked as a technical marketing engineer at Mentor Graphics since 2001 with his primary focus being on high-level synthesis. His areas of interest include DSP and high-performance video hardware. Prior to working for Mentor Graphics, Mike worked as a hardware design engineer developing real-time broadband video systems. Mike received both his bachelors and masters degrees in electrical engineering from Temple University in 1990 and 1995 respectively.