Blogs

Pallab's Place

bloggerNetwork ICs - packaging is a key design element

I recently had a chance to have a conversation with Judy Priest of Cisco about some of the design and packaging issues for...

JB's Circuit

bloggerGoing Beyond and Returning to Reusability

Design for the Consumer Era is seen as the next iteration of the infamous Design-for-X paradigm shift by keynote presenter at...

Taken For Granted

bloggerDATE 2010 Preview

The Design Automation and Test in Europe 2010 conference will be held in Dresden Germany from March 8 to 12. DATE...

EDA Thoughts

bloggerCarbon Footprint is Good For ICs

IBM just demonstrated graphene transistors that could become a replacement for pure silicon-based ICs. | Photo...

Poll

Where will the device design growth be in ten years?
Multicore
Programmable
Wireless
Low-Power
IP
New Technology
   
View Results

Article

[ Printer Friendly ]

Published in December 2006/January 2007 issue of Chip Design Magazine

ESL Synthesis + Power Analysis = Optimal Micro-Architecture

A new power-aware design methodology emphasizes the rapid, early exploration of different micro-architectures before locking onto a particular implementation.
The consumer market is currently experiencing an explosive increase in the use of handheld devices like cell phones, personal digital assistants (PDAs), MP3 players, Global Positioning System (GPS) receivers, digital cameras, and more. The buyers of these products are demanding ever-more sophisticated and changing feature sets. Such feature sets, in turn, require tremendous amounts of additional computing resources to satisfy the demands of a broad and rapidly evolving wireless/mobile-communications market. At the same time, these products must be small, lightweight, and feature-rich with high performance and extremely long battery life.

Similarly, there's an ongoing increase in the use of electronics in the home. Such electronics range from game consoles to DVD/VCR players, digital media recorders, and cable and satellite television set-top boxes. In addition, evolving classes of products like Internet Integrated Access Devices (IADs) are expected to form the central component linking all household voice communications and audio functions. Aside from being relatively cheap and feature-rich, these devices must support extremely high bandwidths and provide extreme computational throughput. Although they're under less pressure than handheld, battery-powered products, these devices are required to consume as little power as possible.

One of the main considerations with regard to creating a lowpower design is to select the most appropriate micro-architecture. A micro-architecture comprises the state and processing-element choices and how data is steered among them. Unfortunately, the tradeoffs between power, performance, and silicon area aren't intuitive. This article introduces the main concepts associated with creating low-power designs. It then reviews the design of a low-power IEEE 802.11 wireless transmitter. The design was based on a new power-aware design methodology, which emphasizes the rapid, early exploration of different microarchitectures before locking onto a particular architecture.

LOW-POWER DESIGN
In the case of application-specific-integrated-circuit (ASIC)/ application-specific-standard-part (ASSP)/system-on-achip (SoC) designs, a number of sophisticated techniques are typically considered in order to reduce power consumption. These techniques include:

  • Multiple voltage islands (Multi-Vdd): Individual functional blocks are run at different supply voltages.

  • Multiple switching thresholds (Multi Vt): Individual logic gates are formed from transistors with low switching thresholds (faster with higher leakage) or high switching thresholds (slower with lower leakage).

  • Dynamic voltage frequency scaling (DVFS): Here, different portions of the device are dynamically set to run at different voltages and/or frequencies on the fly while the chip is running.

  • Clock gating: Portions of the clock tree(s) that aren't being used at any particular time are disabled.

  • Power gating: When they're not in use, selected functional blocks are individually powered down.

  • Memory splitting: If the software and/or data are persistent in one portion of a memory but not in another, it may be appropriate to split that block of memory into two or more portions. One may then selectively power down those portions that aren't in use.

These techniques may be used individually or in conjunction with each other. All of them can dramatically reduce the power consumption associated with a design. Yet these techniques are only as effective as the micro-architecture to which they're applied. If a design's micro-architecture inherently consumes a lot of power, for example, the power savings achieved using the above techniques may not be sufficient to meet the specification. In contrast, say that the design's micro-architecture is optimal in terms of power (in the context of the appropriate powerperformance- area tradeoff ). The additional power savings achieved using the above techniques would then be much more meaningful.

The problem is that capturing a micro-architecture at the traditional register-transfer level (RTL) using Verilog, SystemVerilog, SystemC, or VHDL is time consuming. In turn, this means that modifying the RTL to perform a series of "whatif " evaluations of alternative micro-architectures is extremely time-consuming and error prone. When using a conventional design flow, the end result is usually that the design team is narrowly limited in the number and scope of evaluations that can be performed. This scenario can result in a non-optimal implementation. An approach is needed to quickly and easily explore different micro-architectures in order to determine the most appropriate architecture for the application.

IEEE 802.11A: A SHORT TUTORIAL
IEEE 802.11a is a well-known standard for wireless communications. This protocol translates raw bits into orthogonal-frequencydivision- multiplexed (OFDM) symbols. Each of those symbols comprises a set of 64 32-bit, fixed-width complex numbers. The 802.11a transmitter can be decomposed into a number of separate, well-defined functional blocks (see Figure 1).

Figure 1
Figure 1: Here is a high-level block diagram of an IEEE 802.11a transmitter.

The controller receives packets from the media-access-control (MAC) layer as a stream of data. It makes sure that each part of the data stream comprises a single packet that has the correct control annotations. In addition, the controller is responsible for creating a header packet associated with each data packet.

The scrambler XORs each data packet with a pseudo-random pattern of bits. This pattern is concisely described at 1-bit per cycle using a 7-bit shift register and 2 XOR gates. A natural extension of this design would be to unroll the loops to operate on multiple bits per cycle. The initial value of the shift register is reset for each packet.

The convolutional encoder generates 2 bits of output for every input bit that it receives. Similar to the scrambler, this function can be described concisely as 1-bit per cycle with a shift register and a few XOR gates. Again, unrolling the loop is an obvious and natural parameterization.

The interleaver operates on the OFDM symbol in block sizes of 48, 96, or 192 bits, depending on which rate is being used. It reorders the bits in a single packet. The mapper also operates at the OFDM symbol level. It takes the interleaved data and
translates it directly into the 64 complex numbers that represent different frequency "tones."

The IFFT performs a 64-point Inverse Fast Fourier Transform (IFFT) on the complex frequencies. Its goal is to translate them into the time domain for transmission. The cyclic extender extends the Inverse Fast Fourier Transformed symbol by appending the beginning and end of the message to the full message body.

Note that an 802.11a transmitter typically features a Puncturer function as well. Yet this example design only implements the lowest three data rates of the 802.11a specification (6, 12, and 24 Mbits/s). At these rates, the Puncturer doesn't perform any operations on the data. It has therefore been omitted from this design example and discussions.

MICRO-ARCHITECTURAL OPTIONS
The first stage in the project was to capture an initial representation of the design. To facilitate the exploration of different micro-architectures, it was necessary to raise the level of abstraction significantly above that of traditional Verilog/VHDL-based RTL. Thus, the design was captured using Bluespec SystemVerilog (BSV). It augments standard SystemVerilog with rules and rule-based interfaces that support complex concurrency and control across multiple shared resources as well as modules.

Following the initial "base-level" design capture (which took only three days for the entire transmitter), various microarchitectures were explored. Due to the ease of modifying the high-level design description in BSV, the exploration of seven different micro-architectures only required an additional two days. In this article, only the architectural exploration of the IFFT block will be considered.

The core of any IFFT is a "butterfly" sub-module. These submodules can be created with two inputs (bfy2), four inputs (bfy4, which is sometimes known as a "dragonfly"), eight inputs (bfy8, which is sometimes known as a "spider"), and so on. An IFFT based on bfy4s requires fewer arithmetic operators than one based on bfy2s. Similarly, an IFFT based on bfy8s requires fewer operators than one based on bfy4s. Larger butterflies place constraints as to where the computation can be partitioned, however. They therefore limit micro-architectural options. For the sake of simplicity, only bfy4-based implementations are considered here.

A combinational implementation of the IFFT requires 48 bfy4s presented in three stages. Each stage comprises 16 butterflies. The output values from each stage are permuted before being passed on to the next stage (see Figure 2). Each bfy4 contains three multipliers, four adders, and four subtractors. All of the numbers are complex. They're represented as two 16-bit, fixed-point quantities. A variety of different implementations are possible. As a simple example, the three stages in the combinational implementation could be pipelined so that all three stages operate in lockstep.

Figure 2
Figure 2: This combinational IFFT implementation was constructed from 48 bfy4s.

As another example, the 16 bfy4s in each stage could be replaced by the following: 8 bfy4s used over two cycles, four byf4s used over four cycles, two bfy4s used over eight cycles, or even one bfy4 used over 16 cycles. As each of the three stages is almost identical, they also could be "folded" so that a single stage was used three times. A "super-folded circular pipeline" implementation could even be created based around only a single bfy4. One might intuitively think that such an implementation would be the most efficient option in terms of power. After all, it would use the fewest number of gates. Because of the increase in clock frequency needed to achieve the required throughput, however, this is actually the most inefficient micro-architecture in terms of power.

TOOLS AND METHODOLOGY LEVERAGED
In addition to a conventional Verilog simulator, an RTL synthesis engine, and a visualization utility, the new poweraware design flow featured two more key elements: the highlevel BSV environment from Bluespec (including the Bluespec Compiler and Bluesim simulator) and PowerTheater from Sequence Design (see Figure 3). Bluespec creates a variety of implementation options including performance and area, while PowerTheater provides power estimates for them.

Figure 3
Figure 3: A power-aware design flow is depicted here.

For initial simulation evaluations, the Bluespec simulator is faster than a conventional Verilog simulator. It works at a higher level of abstraction. Having said this, the Verilog generated by the Bluespec compiler can be simulated using a conventional Verilog simulator. The results can be compared to prove that they're cycle-accurate.

Both the Bluesim and Verilog simulators can be used to generate industry-standard value-change-dump (VCD) files. These files can then be analyzed using a visualization tool, such as Debussy from Novas Ltd. Meanwhile, PowerTheater accepts the Verilog file as input. Using the VCD file to determine activity on the signals in the design, it then analyzes and displays the design's power characteristics.

With this flow, multiple micro-architecture implementations can be rapidly generated from a single high-level design. In addition, power estimation can be rapidly run on all options using realistic stimulus for the design. Poor micro-architecture candidates can be weeded out early and with minimal effort. The end result is a design architecture that's optimal in terms of power, area, and performance.

PROJECT RESULTS
Alternative micro-architecture implementations of the IFFT were evaluated in the context of the entire 802.11a transmitter. Seven implementations were evaluated: a purely combinational version, a synchronous pipelined version, and five super-folded pipelined versions with 16, 8, 4, 2, and 1 bfy4 nodes, respectively (see the Table). These explorations were performed in only two days.

Table
Table: Transmitter performance with different IFFT blocks

The most relevant metric in these evaluations is the amount of energy required to process one OFDM symbol, as all of the designs were scaled to produce a symbol every 4 ms. Surprisingly, the 802.11 design using the purely combinational IFFT running at 1 MHz only consumed an average of 3.99 mW. By comparison, the superfolded pipelined version using only a single bfy4 node consumed an average of 34.6 mW. This equates to 8.5X more power. (Note that all of these power values are for the entire transmitter block.)

Once the optimal RTL micro-architecture has been established- -based on early power estimation for many implementation candidates--the design team can start to consider refining the design. For this step, additional low-power design techniques are used. First, a power-debug environment like PowerTheater can graphically highlight the blocks of RTL that consume the most power (active and leakage). Cross-referencing to the RTL source code also is performed. Secondly, automated power linters eliminate wasted power. Popular power-management techniques like multi- Vdd, multi-Vtt, dynamic voltage frequency scaling, clock gating, power gating, and memory splitting may then be explored.

All of these power-management techniques have some tradeoff in terms of area, performance, and/or complexity with regard to the chip's physical implementation. Different techniques and combinations of techniques may be more or less appropriate depending on the design itself (datapath-centric versus controlcentric, for example). Using a technique inappropriately may consume a lot of engineering and silicon resources while providing little gain. Note that the Verilog generated by the Bluespec compiler also can be used as input to a standard RTL synthesis engine. This will generate the gate-level representation that will ultimately be used to realize the design as an ASIC or FPGA.

One of the main considerations with regard to creating a low-power design is to select the most appropriate microarchitecture. Unfortunately, the tradeoffs between power, performance, and silicon area aren't intuitive. There are a wide variety of low-power design techniques, such as multi-Vdd, multi-Vtt, dynamic voltage frequency scaling, clock gating, power gating, and memory splitting. These techniques may be used individually or in conjunction with each other. They can dramatically reduce the power consumption associated with a design. Yet these techniques are only as effective as the microarchitecture to which they're applied. It is therefore essential to perform micro-architectural exploration to determine the optimal architecture in terms of power, performance, and silicon area before proceeding with the what-if analysis of other powermanagement techniques.

Capturing, modifying, and evaluating a micro-architecture at the traditional RTL using Verilog, SystemVerilog, SystemC, or VHDL is time-consuming. Using these languages significantly limits the design team as to the number and scope of evaluations that can be performed. A non-optimal implementation can result. In contrast, a power-aware design flow that combines ESL design capture and verification tools with sophisticated RTL/gate-level power analysis can facilitate design exploration. The result will be designs that have micro-architectures that are optimally suited to their target applications in terms of power, performance, and silicon area.
Holly Stump, VP Marketing, Sequence Design Inc., Santa Clara, CA, email: hstump@sequencedesign.com, and George Harper, VP Marketing, Bluespec Inc., Waltham, MA, e-mail: gharper@bluespec.com.

George Harper is Vice President of Bluespec Inc. He brings more than 15 years of marketing and engineering experience from the semiconductor, communications, and storage industries. Harper has a BSEE and MSEE from Stanford University and an MBA from Harvard University.

Holly Stump is Vice President of Marketing for Sequence Design. She has over 20 years of high-tech B2B marketing, business development, channel management, and international sales experience in the EDA, semiconductor, and electronics industries.

Acknowledgments:
The authors wish to thank the developers of the 802.11 design- -Nirav Dave, Michael Pellauer, Steve Gerding, and Arvind-- who shared their experiences and results from this joint Nokia Research and MIT project.

......................................................................

EDAC EDAC GSA IEC OCP Si Subscribe Advertise About Us Contact Us