High Performance and Low Power Design: Tradeoff or Co-Existence?

The latest EDA innovations address the Quadrangle of Constraints design challenge

 It’s no secret that market demands have resulted in more functionality in almost every mobile, consumer and computing product available today. Few would have thought that streaming television content to a laptop or a cell phone would be practical as recently as 10 years ago, yet according to 2012 Olympics data released by NBC, about 28 million people visited their website and watched 29 million video streams on their tablets or cell phones. Another 35 million viewers watched the Olympics on a desktop or laptop.

Video streaming requires very high performance in different applications: bandwidth from networking devices, and processing and graphics capability in the hardware itself. Regardless of whether the hardware is mobile or plugs into the wall, power is always a great concern. This applies whether it’s due to extending battery life or through green initiatives to save power. This is a just one example of why most system-on-chip devices need to be designed for high performance and low power.

Synopsys performs a Global User Survey (GUS) every year asking respondents to provide detailed data regarding key design trends and challenges seen with their projects. Accumulated over several years, it is interesting to see that more than a third (37%) of all designs today are targeted for greater than 750MHz in performance, almost double the number from just five years ago (see Figure 1). The respondents’ primary end applications for high-speed designs were data center and networking, digital home, mobile multimedia and personal computing and peripherals the very same applications that would be required for the video streaming example cited earlier.

121119_chipdesign_1_big
Figure 1: Clock Frequency of Designs

At the same time, more designers today are targeting advanced process geometries, with about half (47%) targeting 32nm and below (see Figure 2).

121119_chipdesign_2_big
Figure 2: Process Technology of Current Design

The smaller the process node, the leakier it is which makes it worse for power consumption, so many advanced power techniques must be adopted by mobile and non-mobile applications. Narrowing the results of the 2012 Synopsys GUS data to high-performance (> 750 MHz) designers shows that several advanced low power techniques for saving dynamic and leakage (static) power are actively being designed in (see Figure 3).

121119_chipdesign_3_big
Figure 3: Advanced LP Techniques Used in High Performance (>750 MHz) Designs

Many of today’s designers will attest to the fact that even with the need to eke out the fastest clock speed possible, it is necessary to design for low power as well. NVIDIA is a well-known leader in visual computing technologies and the inventor of the GPU, a high-performance processor that generates high- performance graphics on workstations, personal computers, game consoles and mobile devices. Vikas Agrawal, responsible for physical design methodology development at NVIDIA, states that his primary high-performance design challenge is to meet all of the budgets in the “The Quadrangle of Constraints,” which include:

  • High Frequency (GHz+)
  • Low Power
  • Small Footprint
  • Fast Time to Market

Electronic design automation (EDA) tools help to address and manage all of these constraints. But does designing for high performance and low power really require a trade-off, or can they co-exist?

Synopsys has a long history of introducing innovative optimizations – especially in the areas of concurrent timing, area, power and test implementation. However, a specific focus on new technologies targeting high- performance, GigaHertz+ (GHz+) designs has resulted in many recent technology advances. Specific optimization technologies, many of them shared between Design Compiler and IC Compiler, have been added to specifically address achieving GHz+ performance. Some recent advancements involve actual placement techniques, but performing special buffering and using a new clock distribution method also help achieve faster performance. Power optimization has also been built into many of the techniques to address the other aspect of the challenge.

Optimized Placement Results in Better Performance, Less Power and Smaller Area
Performing incremental placement optimization and ensuring that physical datapaths are placed in a structured fashion can vastly improve design performance. Synthesis engines should have physical awareness built-in so that critical timing paths are not created using wire-load models or global router estimates. With physically aware incremental placement optimization, critical timing paths are placed close to the source during synthesis, which results in higher performance. Local timing-driven replacement of cells along critical paths can be performed without creating high-density paths.

For physical datapath support, using relative placement for high-speed datapaths can result in much smaller area, consuming much less dynamic power (see Figure 4). Datapaths are commonly used in microprocessors, digital signal processors (DSPs) and graphics processors, which have aggressive performance, power and area targets. Commonly used building blocks such as adders, multipliers, coders, decoders, etc. used in processor designs can be tiled to layout highly regular structures.

121119_chipdesign_4
Figure 4: Use of Relative Placement for Datapath Structures Saves Area and Power While Achieving Performance Targets

Intelligent Buffering for Faster Designs
With most high-performance designs, every picosecond counts, and buffering on long wire nets to avoid violations can hurt performance. Intelligent scenic, parallel and local buffering will help achieve faster designs. Intelligent buffer placement for sink clusters when sinks are spread farther apart prevents “scenic” loop buffers. Parallel buffer chains when sinks belong to a completely different logical hierarchy should also be avoided. Lastly, smarter local buffering minimizes the number of buffer chains, which also improves performance.

Not All Metal Layers are the Same
With the push towards smaller process geometries, optimization tools should have layer awareness built-in since layer resistance varies dramatically between metal layers, especially at 45nm and below. Typically, metal layers with better delay characteristics must be manually selected or defined. But with advanced layer-aware optimization, critical timing paths are promoted to upper metal layers and less critical nets are pushed to the lower level metal layers. This benefits both timing and buffering overall which helps eke out more performance.

CTS and Clock Mesh for GHz+ Designs
Efficient generation and use of clock gating and CTS are well-known dynamic power savings techniques.  Useful skew is a popular technique to improve design timing by playing with launch and capture clock timing and positive slack paths.  The purpose of a clock mesh is to reduce clock skew, both in the nominal design and across variations.   Clock mesh circuits can consume more routing resources and may consume more power, but they can be used to effectively achieve GHz+ performance.  While mesh gives the lowest skew and on-chip variation (OCV) immunity, it is also the most power hungry because of the dense clock grid.  Careful control of the pre-mesh drive skew minimizes the short circuit current that dominates mesh power usage.

Synopsys recently introduced a new clock distribution method called multisource CTS, which combines the best of clock mesh and CTS.  It offers lower power than mesh circuits, lower OCV latency and is well suited for hierarchical mesh applications.  Multi-source CTS creates a mesh grid that can be 10X less dense than a clock mesh and provides more clock gating depth (see Figure 5).

121119_chipdesign_5_big
Figure 5: Synopsys Multisource CTS Combines the Best of Clock Mesh and CTS

Final-Stage Leakage Recovery Preserves Performance while Saving Power
Channel length variants of libraries are now supplied by library vendors for 40nm and below, exponentially increasing the number of actual cell variants available for libraries. Library vendors, like Synopsys, are creating variations of cells with different channel lengths within each cell. Generally, High-Vt (HVt) libraries are better for power and worse for timing, while Low-Vt (LVt) libraries are much better for timing, but are very leaky. With the availability of libraries containing multiple channel lengths, it is possible to achieve better timing and lower leakage with a Standard-Vt (SVt) cell with a longer channel than an HVt cell with a standard channel length. For the 28-nm HPM process, a shorter length SVt cell would provide 17 percent lower performance and 30 percent lower leakage than a standard-length LVt cell, making it more compelling to use while also saving an extra mask layer. The question is how many of these cell variants should be supplied to the optimization engine? The best methodology is to select a few variants depending on the constraints for synthesis, but have the back-end implementation perform a final-stage leakage recovery with many if not all of the variants included. This will preserve critical performance while saving leakage power everywhere else in the design.

Conclusion
Other power optimization techniques such as the use of power intent to define power domains and shutdown usage, multi-Vt optimization, and multi-corner, multi-mode optimization all work in conjunction with many of the specific high-performance technologies outlined above. In addition, Synopsys has made significant enhancements to ensure that the use of costly LVt cells is minimized with any default optimization.

Lastly, to address time to market, it is best that the synthesis engine (for example Design Compiler), has physical awareness and guidance built-in to minimize iterations between the front- and back-end implementation for faster design closure. Physical guidance in the synthesis engine enables adjustment of the design’s floorplan to remove congestion hotspots or other physical issues directly in the RTL synthesis environment, providing a tighter correlation and better starting point for back-end products, like IC Compiler.

The answer to achieving all of the design constraints is really a combination of methodology and optimization technology. In other words, deploying a design methodology that takes advantage of the latest in EDA innovations can result in achieving all design goals, or Vikas Agarwal’s Quadrangle of Constraints. In the past, power had to be sacrificed in order to meet higher performance, but with recent optimization innovations and low power methodologies, it is possible for both co-exist.

 

Related Video:

DAC 2012: Customer Insight Sessions
High Performance, Gigahertz+ Design Success with the Galaxy Implementation Platform
http://www.synopsys.com/Solutions/EndSolutions/GalaxyImplementation/Pages/dac-2012-customer-insight-sessions.aspx

 

 

Mary Ann White is a product marketing director for the Galaxy Implementation Platform at Synopsys.  She has more than 25 years of experience working in the EDA and semiconductor industries.  White has a BS EECS degree from UC Berkeley.


EECatalog Tech Videos

MAGAZINE

  • Download the latest issue of the Chip Design Magazine
    and subscribe to receive future issues and the email newsletter.

©2014 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS