Posts Tagged ‘San Francisco State University’

How Many Power Islands Is Too Many?

Wednesday, May 13th, 2009

By Ed Sperling

Power domains, also known as power islands, have become to design engineers what multiple cores are to processor architects. They can serve a purpose, namely reducing static current leakage and saving battery life. But they also can add so much complexity that they can make it almost impossible to get a new chip out the door.

Just as there has been talk of hundreds of cores, there has been talk of hundreds of power islands. But trying to verify a chip with that number of power islands is beyond human comprehension at this point, and so far there are no tools to make it simpler. As with multicore programming, there may never be, which is why companies like AMD are now considering dedicating different features for one or more cores rather than trying to split applications into myriad parts.

But power islands bring their own set of unique challenges. When you have 20 power islands, for example, each combination has to be tested. If one is one while another is off, that combination has to be tested when both are on, both are off, both are in sleep mode (or various modes that draw less power). Add another couple dozen power islands and the problem begins approaching epic proportions.

Shireesh Verma, a verification expert in Conexant’s Imaging and PC Media Group, said at this point there are definitely practical limits for the number of power islands.

“The maximum I have seen is 28, but typical is less than 20,” Verma said. “But it is not the complexity in the number of domains. It’s the combination of domains and the sequences you have—how many you have at different power states.”

Power islands must be balanced with the number of cores and the ability to verify the design. At least some of these techniques are needed. Bhanu Kapoor, founder of Mimasic, a consultancy in Richardson, Texas, said that clock gating sufficed as a way of controlling dynamic power until 90nm. But he said from 65nm on, every trick is needed.

“I’ve seen 5 to 9 power islands as the most common number,” Kapoor said. “The largest I’ve seen is from Renesas, which had 23. They had an interesting hierarchical power management scheme. But they’re not all independent [power islands].”

He noted that Nvidia is working on chips with up to 500 cores for graphics processing, which is one of the very few highly parallelizable mainstream applications. He said each power island on an Nvidia chip may control 24 cores.

In addition, there are diminishing returns for power islands. While shutting down power on functions clearly can save battery power by limiting the amount of static leakage, waking up and managing power islands impacts power, as well—both from the state change to the management of those various states.

For most design engineers, though, power islands are a relatively new concept. While the largest semiconductor companies have been working with them since 90nm, most of the work has been experimental.

ARM has had power domain test chips since the 130nm node, but most customers never really began thinking about them until the 90nm node. They’re now starting to hit production in high-volume applications such as smart phones, where turning off functions is essential for preserving battery life.

“A lot of times people will settle for two power domains—here’s the CPU and here’s everything else,” said Rob Aitken, R&D fellow at ARM. “We’ve been interested from the question of how many domains per CPU. We have settled on two. It’s only the more recent cores architected with cores in mind.”

ARM has been demonstrating its 1176 processor cores with state retention, but Aitken said there’s a question of whether design engineers will want to keep everything in the same state. He said that with state retention, there are no more than two power domains per CPU.

“It limits architecturally the things the processor ought to do. There’s also a concept that if you can do it in a nice way that’s transparent to the rest of the methodology, then you can have more. If your RAM had a power switch and it didn’t interfere with anyone’s verification or regulators, then you could put switches on it and buy people something. The added complexity of these domains limits what you can do before you throw up your hands in despair. The limits are in the 20s,” he said.

Less Room For Error

Wednesday, May 13th, 2009

By Ed Sperling

Say goodbye to fat design margins in advanced SoCs. The commonly used method of adding extra performance or area into semiconductors to overcome variability in manufacturing processes or timing closure issues has begun to create problems of its own.

While there was plenty of slack available at 90nm, adding margins at 45nm and 32nm disrupts performance or eats into an increasingly tight power budget—or both. And while this may seem like a relatively problem solving exercise, margins are to a design engineer what a safety net is to a high-wire acrobat. They allow engineering teams to get to market on time and on budget, with an incredibly small number of bugs considering the complexity of current designs.

Cutting margins means substantially more up-front modeling and much more work in figuring out where the variability is in new manufacturing processes. It also means potentially more restrictive design rules and less creativity at the very front end of Moore’s Law.

Different approaches

“At 45nm and 32nm, you can’t put a margin on everything because your performance would go to zero,” said Rob Aitken, a research fellow at ARM. “For the relationship between design and low power, there are two approaches being advocated. One is to do a better job quantifying the margins. Instead of putting a finger in the air and saying, ‘Let’s worst case this and worst case that,’ the solution is more, ‘Let’s actually look at data and figure out where the worst cases lie, look for correlations and relationships between the amount of timing slack we have and our verification extraction methodology. Maybe we can use a better extraction technique and shave off some of that margin.”

A second approach is a more adaptive one, where you know there will be some margins but you don’t know exactly what they are. “When you get your silicon you have adjustable parameters, whether they’re voltage or clock frequency or something else, that you can tune on a per-chip basis to boost up yield and achieve margin without necessarily putting it in the design,” Aitken said.

There are other approaches being advocated, as well. Bhanu Kapoor, founder of Mimasic, a consultancy in Richardson, Texas, said building work-arounds into chips such as classic fault tolerance is an acceptable option.

“We need to start learning to live with errors,” Kapoor said. “Margin-related issues will lead to errors and they will not function correctly at times. That’s where you have to bring in techniques like fault tolerance, where you have error correction. That is a very useful technique for low power, too, because you can work at lower voltages. There will be times when your critical path timing will not be met and you will have errors. Then you try to detect the errors, correct them and learn to live with them.”

Still others say there should be no workarounds. Vinay Srinivas, group director for R&D at Synopsys, said the solution is eliminating variability up front so there is less need for margins and far fewer errors.

“You need better tools, modeling and methodology,” Srinivas said. “Having these guardbands is not acceptable. If you were to guardband everything when the system wakes up you would have so much latency that you couldn’t afford it in the design. At 45nm and 32nm, you need more voltage-aware modeling.”

What works?

While companies such as Synopsys are pushing for better designs up front, the majority of designs will still include some design margins—at least in the short term. Hamid Mahmoodi, assistant professor of electrical and computer engineering at San Francisco State University’s School of Engineering, said there are times when each approach works.

“There is a lot of variability and unpredictability in designs,” Mahmoodi said. “Adding margins is the easiest way to solve that. You can make the design faster than expected by adding in additional biasing or something to cope with the variation in processes. But adding margin means more silicon area and more power. There is cost in terms of additional sensors or voltage regulators. Even corrective action requires overhead.”

Sometimes, in fact, adding margin can be the most cost-effective solution.

“In a given process, which is more cost effective depends,” Mahmoodi said. “If the variability is small, adding margins is the most cost effective solution. When the variability is large, and there are variations is process parameters and voltage, then adding margins is too expensive. At that point, it’s best to consider fault tolerance schemes or adaptive asset calibration methods to make the design more reliable.”

Conclusion

The bottom line is that even the experts disagree on what route to take when. That largely will be up to the design teams working under intense deadlines to get their chips out the door. But at each new process node, there clearly is less room for adding margins and more restrictive design rules for getting chips to yield properly and perform as planned within power limits defined by customers. And if you think it’s hard at 45nm, it’s only going to get more difficult over the next couple nodes.