Posts Tagged ‘power islands’

Verifying Low-Power Designs

Thursday, January 14th, 2010

By Ed Sperling
Power islands and multiple voltages used to be reserved for cell phone and process companies, but as more companies move to 65nm and 45nm process nodes these approaches to saving power—particularly in chips with multiple cores—are becoming mainstream.

The problem isn’t in the architecture of the chips, although that certainly brings its own set of challenges. More and more, the real holdup is at the verification level. While the percentage of time spent in verification has remained relatively steady—anywhere between 50% to 75% of the total time it takes between architectural design and tapeout—the size of the verification teams has doubled and in some cases tripled.

“Verification is the next big challenge,” said Naveed Sherwani, CEO of Open Silicon. “As an industry we have not done a good job managing verification. A new methodology would be very welcome. We have had to develop methodologies in-house to deal with this.”

Sizing up the problem
All of the major EDA vendors recognize the extent of the problem. They’ve been dealing with horror stories from the field since the 90nm process node. And according to TSMC, about two-thirds of the industry is now at that node or beyond.

The most advanced parts of the semiconductor industry are now working on 32nm and 28nm, with even more power states—on, off, sleep, and sometimes even more in-between states—more power islands and more processor cores. In the most advanced chips, some of those cores are even heterogeneous, which means they may have different voltages and states than the other cores. That allows a system to reduce power consumption overall and concentrate power where and when it’s most needed.

“When you cross 100nm, you’ve got to design this stuff in or you’re not competitive,” said Barry Pangrle, solutions architect for low-power design and verification at Mentor Graphics. “We’ve got a number of people well down the road on this. Larger companies with larger design teams can afford the engineering expense to make this work. But as more people go to more advanced nodes they’re going to be dealing with issues they never had to deal with before.”

The first thing that most designers encounter is complexity. What used to be done on a spreadsheet is much harder to manage now.

“There are a whole series of interrelated topics of increasing complexity,” said Srikanth Jadcherla, group director for R&T at Synopsys. “The state space is huge, and when you start dealing with three or four power islands it’s amazing how quickly the number of states and sequences explodes.”

It’s also amazing how complicated this stuff can get very quickly. Consider, for example, what happens when you’ve got a device and you’re checking e-mail. The processor wakes up a number of mixed signal blocks, then turns off what’s not being used. But that sequence also has to be ordered, which means you also have to order the power islands.

“You may wire it from low to high when you need to go from high to low,” said Jadcherla. “The problem is that you’re trying t predict island orders. You can create a safe graph, which is a set of possible states so you can look at a design and ask, ‘What are the safe ways this will work?’ But when you’re dealing with 36 to 40 islands, there’s no way you can set it up safely.”

Tales from the crypt
One of the most common mistakes that design teams make in chip engineering is internal organization and communication. The team design and communication has to reflect what’s going on in the chip design and verification.

“We’ve seen problems in a library group, for example, where they save power in a certain way that’s different from other groups,” said Mike Carroll, product marketing manager for front-end design at Cadence. “Communications between teams is not always the tightest loop. If one group instantiates it the wrong way, you may have power shutoff without state retention.”

In a library, that can be disastrous for a system—or at least some of the system’s functions.
It’s also a big problem in flash. Consider, for example, a smart phone where the low-battery signal is flashing and the system is ready to shut down to keep enough charge in the device to maintain essential data in memory.

“If you get a phone call at that time and you pick it up, it can be disastrous for the system,” said Synopsys’ Jadcherla. “But how do you prove that? It’s not easy. You need to come up with a methodology to test it. That’s where random constraints and testing come in.”

Another problem is when engineers route signals across other blocks or power domains. Pangrle noted that may not show up in the block diagram, particularly if the block is powered down.

“The key is to keep the logical hierarchy matching the physical hierarchy,” he said. “But design teams are not experienced with that. Another problem is that the signal may not be the same on one side as the other.”

That can also happen at advanced process nodes with process variations—an issue that no one even paid attention to at 130nm. At 45nm, it can be the difference between a functioning chip and a buggy one.

Advice from the experts
Low-power experts have consistently advised design teams to think about low power at the architectural level, and nothing has changed in that regard. What has changed are the numbers of possibilities for verification. Adam Sherer, product marketing manager at Cadence, said that for every power domain there are two-to-that-power possible states. So if there are two domains, there are four possible states, and so on.

“Verification does not have a theoretical limit, but pragmatically there are limitations,” Sherer said. “The problem is coverage. If you can manage to create a loop, you can extend it to the power domains. We’re seeing the same from the functional teams. Randomization testing is where the functional coverage comaes in. As long as there is coverage and you can see functional sequences you have vision into the power domain space. It has to be able to come out of shutdown and on the implementation side it has to work.”

That means establishing power intent so you shut off something at a particular time.

All the EDA companies say that a verification methodology helps, as well, although each favors their own flavor, whether it’s OVM or VMM. Other higher-level abstraction standards such as CPF and UPF, and TLM 2.0 also help significantly.

“With TLM you can figure out what’s in hardware and what’s in software and which blocks run at which voltage,” said Pangrle. “Then you can put in which blocks to shut down entirely and specify the power states.”

And if you can create an effective coverage model based upon those factors, then at least you have a chance of getting a chip out the door on time, possibly within budget, and one that actually works.

Experts At The Table: Rising Complexity Meets Verification

Thursday, December 10th, 2009

By Ed Sperling
Low-Power Engineering sat down to discuss rising complexity and its effects on verification with Barry Pangrle, solutions architect for low power design and verification at Mentor Graphics; Tom Borgstrom, director of solutions marketing at Synopsys; Lauro Rizzatti, vice president of worldwide marketing at EVE, and Prakash Narain, president and CEO Real Intent. What follows are excerpts of that conversation.

LPE: When do you verify closer to the metal and when do you move to a higher level of abstraction?
Pangrle: The higher level of abstraction is the wave of the future. It’s where things are headed.
Borgstrom: It’s been the wave of the future for the past 10 years, though.
Pangrle: But it is different the time. We’re seeing traction with customers and the blocks they’re using it with, and we’re hearing from them, ‘You know, we couldn’t have done it at this time if we didn’t have this tool.’ That’s different. There’s an efficiency that’s only going to ramp from here. But at the same time, we’re modeling in C. If the architects have the same hole with what they’re using and the same way they’re testing it, and that’s what you’re going to measure your RTL against, how are you going to catch a problem? The hole is in the model.
Narain: No, because when the verification engineer compares a mismatch between the RTL and a reference C model, then you can find errors in the reference model.
Pangrle: You can always find errors later in the process. People still check at the gate level and find errors in RTL.
Rizzatti: Exactly. Nvidia is still emulating at the gate level with hundreds of millions of gates.
Pangrle: Doing this higher-level stuff doesn’t preclude doing things at the gate level. But it does give you a higher level of confidence that if I’m taking C and using that to generate RTL, then hopefully I won’t have as many errors.

LPE: When do you know when you need to look at the higher level or at the RTL level?
Borgstrom: It’s really a continuum. Every project starts out as a concept, so you do some high-level modeling and algorithm development. Then it follows the classic ‘V diagram’ development where you go from that high level and make all the components. When you get down to that IP or block level, you’re doing very detailed verification, simulation and formal analysis. Then when you get all those blocks developed you start integrating them and do full chip verification involving simulation and hardware-software co-verification on a hardware-assisted verification platform. You eventually get up to full system integration where you’re prototyping with external interfaces. At each phase of the process you’re using a different set of tools to find a different set of bugs.
Narain: There’s top-level activity and then design starts in a distributed manner. Designers start implementing various pieces of a design based upon a spec, and then it comes together. So the design starts of as blocks, then it goes to clusters, and then to a full chip, and at every point you have a chance to use verification. The big questions are what is the cost, how much investment you’re going to make and what is the return. Typically block-level verification is compromised because there are too many testbenches to develop. People tend to do simulation more at the cluster and full-chip level. At every point in time you apply the cheapest technology. A lot of blocks can use formal verification. When you get to clusters, you need more sophisticated techniques. At full-chip level, you start getting into emulation.

LPE: Analog is separate, as well, right?
Narain: Yes, analog is different and you use your own techniques for that.
Borgstrom: There’s a dedicated tool chain for doing custom blocks and custom simulation. The designer and verification engineer are very often the same person. It’s a very close iterative loop. There’s a real need for doing mixed-signal verification once you start integrating the whole SoC with the analog blocks and making sure that analog-digital boundary is behaving correctly, both from a power perspective and a physical connectivity perspective.
Rizzatti: If you have a problem at the higher level of design in C and you don’t catch it, you don’t catch it at the RTL level?
Narain: That’s correct.
Pangrle: But the reference is the same.
Narain: No, what you’re saying is that you take the C reference model and automatically derive RTL from it. I’m talking about taking a C model and let the designers independently generate RTL. If you automatically derive RTL you won’t find the errors. If you use a spec and let the designers test against that, then it is conceivable that a bug will be caught.
Pangrle: There is a chance of an RTL to C mismatch. But it’s more likely that will be a problem with how the RTL was implemented. If I have some idea of what I’m looking for—and that’s usually defined by the architect—then I’m more comfortable knowing that what I’ve modeled has been turned directly into RTL. If I hand it over to an engineer to figure out my intent and then he goes off and does his own thing, then there’s even more likelihood of a mismatch. The mismatch will be more likely from the translation.
Narain: If you compromise on independence of fundamentals of the checking, then you’re compromising on the integrity of the verification.
Pangrle: This is like having multiple votes. You should have three independent design teams?
Narain: If you can afford it.
Borgstrom: It’s clear that this high-level design and verification flow is relatively new and controversial. But it does show a lot of promise.

LPE: Let’s swap topics. Software is becoming more complicated and so are power issues involving islands and various modes. What does that do to the verification process?
Rizzatti: It’s a nightmare.
Borgstrom: It’s a lot more complicated. The data we have shows that at about the 65nm node, average design team size is evenly split between hardware and software. As we get into smaller and smaller geometries we’re starting to see software-driven architectures where a lot of the value of the semiconductor product comes from software delivered along with it. When you have a huge software team waiting for software to come out, that’s not very economical. One thing chip companies are trying to do is figure out how to get their software teams started sooner. One way you do that is to come up with a virtual platform. You come up with a SystemC, TLM-level model of the overarching design even a year before you have silicon and get people writing software against that. As the RTL becomes more mature, you can put that into a hardware-assisted platform like an FPGA rapid prototype and continue your software development running at 10MHz to 30MHz prior to silicon commitment.
Pangrle: We’re seeing similar things in the market. There’s a shift toward software. Anything you can do to help teams get started earlier on software helps close that whole window down in terms of the amount of time they need to get a whole system up and running. We feel that standards help speed up this whole process. Using TLM 2.0 help is all part of the Open SystemC initiative. You also really want to know what the intent is of the design and how you’re going to partition it up, because that has an effect on how you start verifying it. Being able to determine which blocks you want to run at which voltage levels and which ones you’re going to use to create voltage islands—that’s information that gets passed down for verification. If I run everything at a single voltage, then I don’t have to worry about these kinds of issues. But if it makes economic sense to run a block at lower power, then I have all these other things I have to check.
Narain: Control over the design implementation process is a big problem. There is a strong requirement for software to eliminate errors in the implementation.
Rizzatti: The crossing point between hardware and software, according to Handel Jones, was 130nm. The other thing is that you start with virtual prototyping and after that you move to FPGA prototyping. If that were the flow, it would kill emulation. I don’t see that happening and that is not what the large chip makers are doing. There is a very clear moment in the flow where emulation is unique, which is the integration between hardware and software because the FPGA prototyping will not give you any ability to trace bugs. And we see this more and more.

Defining Reliability In Low-Power Designs

Thursday, October 15th, 2009

By Ann Steffora Mutschler
Having a clear understanding of what reliability means for a particular low-power application can make a significant difference when it comes to communicating with engineering team members and customers. Is reliability simply a question of how long a device can run without errors? And what happens to reliability when power modeling, verification and other design techniques are utilized?

As Massimo Sivilotti, chief scientist at Tanner EDA pointed out, “These questions are complex, and there is no universally accepted answer to any of them.”

In general though, low-power designs involve both architectural and circuit design components and issues such as sub-threshold leakage currents, upsets due to substrate- and power-supply-coupled noise. Device parameter variations due to statistical process factors for deep-submicron devices become more acute as power levels fall. As such, state-of-the-art device models, up-to-date model parameters from foundries, and data-driven noise calculations become essential.

From Intel Corp.’s perspective reliability is more an attribute of the nature (or use model) of an application – whether it is low power or not. “For example, a low power smart phone application would define ‘reliability,’ both from device and user perspective, very differently than an equally low-power battery-powered medical device that administers medicines to critically ill patients,” said Pranav Mehta, chief technologist for Intel’s Embedded Communications Group. “Having said that, low-power designs do offer special challenges to designers. Balancing the need to lower the operating voltage to reduce power while trying to achieve competitive performance provides significant challenges in terms of process technology recipe, architectural tradeoffs, as well as design tool chain and methodology selections.”

The core of the problem
Diving down, technically speaking, Srikanth Jadcherla, group director of R&D for Synopsys Inc.’s Verification Group, noted that reliability in low-power design goes back to the fundamentals – avoiding permanent or temporary dysfunction of the device due to physical effects such as electromigration, self heating and rail/signal integrity failures. While these might have been overlooked before, the causes of the failures or in some cases the magnitude of certain phenomena can no longer be ignored.

“Some of these cause IC designers to adopt a certain power mitigation (or current mitigation) technique,” Jadcherla said. “Some of these are caused by what is done for power reduction. So, it cuts both ways. Specifically, as the industry heads into nanometer designs, current magnitudes are rising while wire cross sections are shrinking – increasing current density dramatically. This puts a lot more stress on the wires from an electromigration point of view and also from a heating standpoint. Ditto for leakage, which increases the average amount of current flowing through the wires irrespective of activity. This issue didn’t exist before. To combat these issues, IC designers have adopted aggressive techniques such as power gating and voltage scaling to opportunistically reduce the current draw.”

Docea Power, based in Moirans, France, looks at reliability in low-power design from the system perspective. CEO and co-founder Ghislain Kaiser said high power consumption affects reliability of electronic systems due to thermal dissipation and electrical issues induced by high-density currents.

There are multiple reliability issues related to high temperature including physical stress on the package, especially on die-attached material; transistor and interconnect deterioration; alteration of transistor switching time, hence timing hazards; thermal runaway risk when leakage current becomes significant; and high temperature that may require cooling systems such as a fan, which increase the risk of reliability if a failure occurs in the cooling system.

But, Kaiser noted, high-density currents alter electrical properties by causing such issues as electromigration of metals atoms along conductors; crosstalk, which degrades signal integrity; or a voltage drop along resistive wires. “This last point is particularly important when a low-power approach like voltage scaling is used. Lowering voltage allows you to reduce power consumption, but it increases the risk of going below the working point of transistors. The design work involves correctly sizing the voltage margin regarding the use cases,” he said.

Jameel Hussein, Technical Marketing Manager for Xilinx Inc.’s Power and Configuration Solutions reiterated that consideration must be given to thermal management at both the component and system levels to ensure that all devices are operating within their specified temperature range and to maximize overall system reliability.

“The device’s operating (junction) temperature is a function of the device power, its ability to transfer the resultant heat to the surrounding environment via the component packaging, and the ambient temperature of the system,” Hussein said. “Reducing the device power consumption, therefore, has two significant benefits. First, it lowers the system cost by enabling the use of less expensive thermal solutions to keep the device in its intended operating range. Second, reduced power means lower operating temperatures, which directly translates into improved component and system reliability.”

Added Hussein: “The temperature is a function of the power so if you can lower the power, you can lower the temperature of the actual device and its surrounding parts. Equation 2 is based on the acceleration factors between the two different devices in this example. If it is a difference of 10 degrees, in junction temperature, the equation shows that a device that runs 10 degrees less on a junction temperature will last twice as long as one running 10 degrees hotter,” Hussein explained.

Actel, which has been the low-power leader in the FPGA space, has focused part of its reliability argument around on-chip memory. Unlike other FPGAs, Actel’s use flash memory, which is less susceptible to single-event upsets caused by either terrestrial or cosmic radiation. And while that’s of obvious importance in aerospace applications, it’s also considered important in critical functions such as automobile powertrains because upsets often affect multiple bits at increased densities. That may be enough to shut down a chip permanently.

There are workarounds in circuitry and software for these kinds of problems, but they add more area to the circuitry and raise the overall power consumption to make sure there are no problems.

New techniques impact low-power design
With designs today utilizing techniques such as power modeling and complete coverage verification there are pros and cons as to the impact on the design.

“Power modeling and advanced verification techniques have definitely improved the ability to hit the projected performance/power curve for a specific design. However, at the end of the day, it still comes down to understanding the target application usage model and using the modeling techniques to tune the design appropriately. Without it, one may still come up with an impressive looking data sheet that really doesn’t cut muster in real application,” said Intel’s Mehta.

In addition, Synopsys’ Jadcherla explained, some of the techniques adopted such as power gating and voltage scaling themselves cause new problems. “First, IC designers really need to now analyze each physical region (island) by itself independently, unlike the entirety of the chip. And they need to do this across all the temporal situations (aka states and transitions) that are likely to occur. Second, the very act of moving voltages adds new irritants into the integrity of rails and signals – the collapse of either can cause temporary failures or permanent device breakdown.”

Another consideration of using advanced techniques is that the architecture team has to model and evaluate the benefits of various low power techniques regarding the use cases targeted by the final application. This leads to defining the various voltage and clock domains, Docea’s Kaiser said.

Finally, a new entrant into this drama has been temperature, Jadcherla said. “Cross die variations are exacerbated by low power designs. Perhaps one part of the chip is mostly off (cool) and another is mostly on (hot). There is very little data on die-level effects, though my suspicion is that field failures haven’t been studied enough. People just can’t wait to get rid of their older model consumer device. At the system level, however, temperature or rather failure to manage temperature of SoCs has caused enough embarrassing failures – devices exploding, devices locking up thermal runaway, and laptops hot enough to boil water.”

Experts At The Table: Building A Better Mousetrap

Friday, September 4th, 2009

Low-Power Design sat down with Richard Zarr, chief technologist for the PowerWise Brand at National Semiconductor; Jon McDonald, technical marketing engineer in Mentor Graphics’ design creation business unit; Prasad Subramaniam, vice president of design technology at eSilicon; Steve Carlson, vice president of marketing at Cadence Design Systems, and David Allen, product director for power at Atrenta. What follows are excerpts of that conversation.

By Ed Sperling

LPD: How important is it to be green?
Zarr: In the past, when our customers plugged something into the wall they didn’t care. They pushed the problem off. But with some of the legislation, people are starting to care. No system is ever loaded 100% all the time. Even data centers are not always busy. Typically 50% to 80% of the power is wasted. They’re running at high speed and consuming power when they don’t need to be. But they’re not doing anything about it because it’s adding complexity or it’s adding cost. You’re designing the hardware, but someone is taking that and using in ways that you didn’t design it.
Carlson: I wrote a paper on the effects of virtualization. One of the things they would do in data centers is offload the servers, but the servers would have to go into standby mode when they’re not being used. They didn’t stand-by very well because they were never designed to stand-by. An improvement in the architecture at the macro level would be a big benefit, but people aren’t doing that unless they’re forced to do it or unless it becomes a competitive advantage.
McDonald: Where people have been investing the time—in the handhelds and at the micro level and device-level optimization—we’ve squeezed a lot of benefit out of that. Things can be made better, but a lot has already been done. At the macro level, almost nothing has been done.
Allen: The great thing about the handhelds is they’re proof points that it can be done. There’s a lot of work going on in the networking companies now, but you’ve got to start at the IC level. Once you’ve got the infrastructure there, then you can start layering on energy efficiency in the lighting, the HVAC in the data center and controlling of peak power.
McDonald: Cisco did a study in 2006 where they determined that if they saved 1% on the power for a network router it was the equivalent of taking tens of thousands of cars off the street. But when you’re designing it, no one cares. They just want to get it out the door and meet performance.
Zarr: Education is a big thing here. Designers are not educated in the vehicles to reduce the power consumption in their designs. It hasn’t been a priority for them.
McDonald: It’s also the delayed benefit. It’s not a benefit to the designer or even the company making the chip.

LPD: If it came down to hitting a deadline for getting a design out the door or cutting power, what’s the likely response?
Carlson: In the case of a very large printer company, it’s getting the chip out the door—even if it ultimately costs more money.
McDonald: Power is not really what most people care about up front. You care about the economics. You care about power only insofar as it affects the economics.
Zarr: It may have more of an impact as we go forward.

LPD: What happens if we trim the margin in designs? Do we gain power savings?
Carlson: There’s incredible waste. If you look at design methodologies for front-end design teams, there was a 5% margin. Now it’s typical to see 20% to 25% margin. One company we’re working with is going to use a 50% timing margin on the design for a battery-operated application. You start to explain what the impact will be on the overall logic architecture and the response you get is, ‘I hadn’t thought of that.’ You need to look at timing and power together. That’s where the real increases in margin occur.
Subramaniam: Margin is an issue, but it’s even more than that. Today we’re overdesigning chips because we are designing for the worst-case scenario that may never occur. So how do we take advantage of the process itself? You need to monitor the chip and lower your voltage accordingly. You’ve designed the chip for the slow corner, but you know that in normal conditions the chip is going to work much faster.

LPD: We’ve been adding cores and power domains on a regular basis. Now we’ve got a bunch of this stuff. How do you manage all these pieces?
Allen: You need to start at the architectural level. You can’t retrofit designs on the chip. There are a small number of power architects who can do this. They understand what the tradeoffs are, and from an EDA perspective you have to arm them with the right tools.

LPD: How small?
Allen: At ST there might be four. At TI there might be a half dozen. Maybe that’s enough. You don’t need a whole new power architecture for each derivative. You need a power architecture for the first one, and then you may get 30 or 40 derivatives out of that. But can every small company afford to have one of those guys? No. But big companies do have this expertise.
Carlson: There are sources of expertise to bridge the gap.
Allen: With external expertise, there’s a question of how much the design team learns.
Carlson: It depends on how you structure the engagement. If it’s a turnkey operation, they’re not going to learn much. But you can also teach them how to fish.

LPD: Do we ever get to the point where it’s no longer economical to do this stuff?
Subramaniam: You can probably go quite low on voltage for digital logic. We had a customer running digital logic at 600 millivolts. They could afford to do that because the chip runs at a very low frequency. If you’re willing to go with low performance, you can go to very low voltage on digital logic.
Allen: We’re not quite at the end of this road. Another thing to think about is how much charge is in a battery. That’s not really going to change that much. But there is still a lot of potential for architecture at the high end of the spectrum. Those guys can probably learn a lot.
Zarr: Even architectures that scale frequency will find benefit.

LPD: Is there a limit to how far down we want to go down the Moore’s Law roadmap, though?
Subramaniam: There is definitely a tradeoff. Only those with high-volume products will be willing to go to the next step.
Zarr: You never know until the next materials come out. They’re just continuing with strained silicon techniques and SOI.
Subramaniam: There are still a lot of designs done in 0.25 micron and 0.18 micron today. TSMC has not retired a single process since its inception. People will be willing to go back to older nodes if it helps them, but it doesn’t really help with power because they consume more power.

LPD: How do more restrictive design rules affect all of this?
Carlson: That will drive a renaissance in architecture. The process guys will quit solving the problem for you, and you have to be more clever about everything. You can’t just say you’re going to use the next-generation LP process and think you won’t have a problem with it.
Allen: There have been a number of times where the design guys said, ‘Leakage is going to kill us,’ and the process guys said, ‘Don’t worry about it.’ Then it scales to the next generation and it’s something else. The process guys may save us, but they won’t be able to save us forever.
Zarr: Somewhere along the line we’ll have to change materials, whether it’s carbon or something else. Everyone’s trying to avoid making that kind of investment.