Published in Fall 2012 issue of Chip Design Magazine
Optimizing SoCs for Lowest Power with Innovative Logic Libraries
Mobile communications, multimedia and consumer systems-on-chip (SoCs) must achieve the highest performance while consuming the minimal amount of energy to achieve longer battery life and fit into lower cost packaging. But often power is the hard constraint, and the challenge is getting the best performance possible within the available power budget. Each new silicon process generation brings a new set of challenges for providers of logic library and memory compiler IP and a new set of opportunities to create more power-efficient IP to enable SoC designers to deliver the last megahertz of performance, while squeezing out the last nanowatt of power and last square micron of area. SoC designers need to first be aware of the advances in IP and then they must know how to take advantage of these advances for all parts of their chips using the latest EDA flows and tools to stay ahead of their competitors.
This article provides guidance to SoC designers and power architects on achieving optimal tradeoffs in SoC watts per gigahertz by combining innovative power management techniques using multiple voltage threshold (VT)/channel length libraries. It discusses how inherent performance vs. power tradeoffs of silicon processes from 65 nanometer (nm) through 40nm, 28nm, 20nm and beyond can enable power-optimized logic libraries.
Getting the Power Architecture Right
Given an aggressive SoC functional specification with conflicting performance, power and area targets, the first step in determining SoC power strategy is to create a power budget; a spreadsheet that lists the dynamic and leakage power of each of the functional blocks of the SoC along with their performance, voltage, area and other key attributes. This plan should be calibrated using trial logic synthesis runs and memory instance power, performance and area (PPA) reports in the target process to provide a baseline budget for the new design. It is important to have a broad range of “what-if” scenarios from a wide variety of multi-VT and multi-channel options for logic and configurations for memory instances in order to zero in on the best starting points. Comparing previous functions implemented in previous process nodes to the same function in the new process is valuable in calibrating an SoC power budget.
Figure 1: This “Moore’s Law Report Card” graph plots the normalized area of a 32-bit processor core on the vertical axis as it grows with increasing operating frequency plotted on the horizontal axis. It shows how blocks designed with single VT libraries just get larger instead of faster when the library has reached its performance limits and also how block area changes using libraries from different process variants (low power, general purpose) and different process generations (40nm, 28nm).
With this level of calibrated granularity, circuit and design architectural decisions can be made for functionality, performance, power and area. To perform this analysis, it is important to have a power optimization toolbox and to understand how these libraries take advantage of the capabilities of the available process options.
Power Optimization Toolbox
A power optimization toolbox consists of all the logic cell functions needed to implement the power optimization techniques for the SoC. These techniques include clock gating, shut down, deep sleep, multiple voltage domains, dynamic voltage and frequency scaling (DVFS), state retention, and voltage biasing. During the adoption of the first 65-nm processes, these capabilities became increasingly supported in EDA flows and have now become part of the IEEE 1801 standard, which is based on the Synopsys UPF 2.0 specification. This “bag of parts” contains all of the necessary circuits to perform power optimization functions and annotations that the EDA tools need to validate the design correctly.
Figure 2: Power optimization kits include power gates (switches) to control power to a block, isolation cells to manage signals from powered-down blocks, retention registers (both balloon style and live latch style) to maintain state in powered-down blocks, and level shifters to translate signals between voltage domains.
Because they can highly impact SoC functionality, performance, power and area, embedded memories must contain easy-to-use, built-in low-power circuits for optimal memory power efficiency.
Long-Channel to the Rescue for Leakage, Overdrive and Ultra-Low VT (ULVT) to the Rescue for Performance at 40nm
Each generation of silicon has had increasing leakage issues due to the thinning of the gate oxide and other unintended transistor construction side effects. At 40nm drawing gates with a channel length that is 25 percent longer than the minimum provides a dramatic leakage reduction with only a minor reduction in performance. This requires libraries to be redrawn with the longer channel length for optimal area efficiency. At the same time, overdrive voltages (within silicon reliability limits) can boost performance of specific blocks with level shifters at their interfaces. Accurately characterized 40-nm logic libraries at these multiple variants and voltages provide SoC designers and power architects with a broad palette of implementation alternatives that are optimized for the highest performance and for the lowest power.
Figure 3: This 40nm graph plots the relative leakage of a library on the vertical axis (using a logarithmic scale) and the relative performance of a library on the horizontal axis. The graph shows the leakage advantages of the 50nm long channel cells and the performance advantages of low and ultra-low VTs and of using overdrive voltages (with corresponding leakage tradeoffs).
HKMG and Lithography-based Gate Lengths Extend the Power Curve at 28nm
At 28nm, High K Metal Gate (HKMG) technology provides process improvements that make it a very attractive node to use for building high-performance/power-efficient SoCs. PolySiON processes that use much of the same manufacturing equipment provide very cost-effective alternative silicon. Many of these silicon processes support multiple transistor gate lengths at the same gate pitch. This process feature enables multi-channel libraries without the area penalty of designing to the worst-case channel length to achieve footprint compatibility. These swappable libraries facilitate late-stage leakage recovery performed by automatic place-and-route tools and very fine granularity in power optimization.
Figure 4: This 28nm graph plots the relative leakage (at the leakage corner) of a library on the vertical axis (using a logarithmic scale) and the relative performance of a library (at the signoff corner) on the horizontal axis. The graph shows the leakage advantages of the mid and max channel cells and the performance advantages of low and ultra-low VTs. and of using overdrive voltages (with corresponding leakage tradeoffs).
Additional VTs (ultra-high VT, ultra-low VT) provide even more granularity (and costs!). However, with all of these library options, the amount of data presented to the synthesis and place-and-route tools can be overwhelming. The aggressive use of “don’t use” lists (initially hiding both very low and very high drive strength cells) and proper sequencing of libraries provides an efficient methodology for identifying the optimal set of high-speed and high-density logic libraries and memory compilers that will achieve optimum performance and power tradeoffs at the minimum cost. These methodologies are effective on many different circuit types—CPUs, GPUs, high-speed interfaces, as well as other processing applications and are dependent on the specific circuit configuration and process options being used. With the aid of an applications note or an applications engineer with a good understanding of the tools (synthesis, place and route) and the flows, one can quickly determine the optimal library combination and sequence for a given configuration of a design. Acquiring specific libraries for each different type and configuration of CPU can be a waste of time and money. A high-performance logic library designed for core hardening can deliver optimal performance if it includes a full selection of efficient circuit functions, the right set of variants and the right set of drive strength granularities. Once target performance is achieved there are multiple strategies that can be employed to optimize power.
Figure 5: This table shows library selection and sequence recommendations for synthesizing blocks for different levels of target circuit performance with respect to the inherent process capabilities.
The Next Turn in the Power Curve, FinFET Logic Libraries Save the Day
Rather than build multiple processes targeting the high-performance market and the low-power mobile market, the 20-nm node is populated with foundry uni-processes—a single silicon process that provides the high-performance market with a very low VT, the lower power market with a very high VT option and everything in between. This single process presents a different set of challenges for those who build and those who use libraries to build SoCs. These processes provide additional transistor gate lengths and VT options for control in the performance vs. leakage tradeoff in designing SoCs. Unfortunately, 20-nm and smaller geometries also carry additional costs and complexities—double patterning technology (DPT) for metal, significantly increases mask costs, silicon wafer costs, and routing complexity.
Logic libraries at 20nm continue to squeeze the most performance, power savings and area savings out of the available silicon processes, taking full advantage of the latest features in the advanced EDA tools used to implement 20-nm and more innovative designs.
At more advanced nodes FinFETs are replacing planar FETs (also called planar CMOS). FinFETS are estimated to be up to 37 percent faster, while using less than 50 percent of the dynamic power. This same technology can cut static leakage current by as much as 90 percent compared to planar technologies.
Figure 6: This 22-nm tri-gate (FinFET) graph plots normalized transistor gate delay on the vertical axis and the operating voltage on the horizontal axis. It shows how these advanced transistors can operate at lower voltages with good performance, reducing active power by more than 50 percent.
FinFET logic libraries bring their own set of challenges due to quantization of fins and other silicon effects. Meeting these challenges requires the close collaboration of a multi-disciplinary team. Look for a vendor with expertise in the development of TCAD, transistor modeling, device and parasitic extraction, FinFET-specific layout, place-and-route tools, memory compilers and logic libraries.
Each new SoC process generation brings a new set of challenges and a new set of opportunities for providers of logic library and memory compiler IP to deliver the optimal PPA. SoC designers need to be aware of and know how to take advantage of advances in library IP using the latest EDA tools. Synopsys has the experience and infrastructure to deliver effectively architected, efficiently designed, accurately modeled logic libraries and memory compilers, thoroughly integrated into EDA flows, silicon-proven and rapidly delivered through an experienced worldwide support infrastructure.