Posts Tagged ‘Software’

Next Page »

Experts At The Table: Making Software More Energy-Efficient

Friday, January 27th, 2012

By Ed Sperling
Low-Power Engineering sat down to discuss software and power with Adam Kaiser, Nucleus RTOS architect at Mentor Graphics; Pete Hardee, marketing director at Cadence; Chris Rowen, CTO of Tensilica; Vic Kulkarni, senior vice president and general manager of Apache Design, and Bill Neifert, CTO of Carbon Design Systems. What follows are excerpts of that conversation.

LPE: How much of the battery drain on a smart phone is caused by the hardware, how much is caused by the software, and how much is caused by bad reception?
Kaiser: Software controls a lot of it. Bad hardware that does not allow you to turn something off is one cause. But that doesn’t happen as often as bad software. If the hardware has one clock that turns everything off then you have a problem because whenever you want to use one little block you have to turn on five. But with software you have to give engineers feedback and tell them what knobs to turn. Ideally, you even give them an algorithm for how to tweak those knobs. We tried to do this with Nucleus. The drivers automatically manage their own power for WiFi or anything else. If no one opens the driver it won’t burn power. If you can lower power, don’t worry about the rest of the OS. Just minimize dynamically. You can set up limits for the driver. Then the application guy just needs to be able to allow the device to turn on. You need to give people simple metrics like CPU utilization. And if you give metrics on how much power your CPU is using while idle and how much it’s using when it’s busy, you can tell how much your CPU is using. Then, if you lower the frequency to half and the CPU is twice as busy, it’s actually burning more power. The compiler needs to do the job.
Rowen: The compiler can do a good job of the lower level things, but the choice of algorithms and which states you’re going to transition among is way beyond what the compiler has any access to. I recently saw a study of the number of states that a cell phone goes through. Something like 38 messages had to go back and forth between the software running on the phone and what was going on in the base station that were basically a negotiation as the phone entered a cell. There are some very tough and complex tradeoffs to make about whether you want to save power at one level by doing fewer transactions or you want to be aggressive and get the negotiation done as quickly as possible because it allows you to get into the lower power state as quickly as possible. There are some non-obvious tradeoffs at work at the system level because you have to determine if the phone is in a low-power or high-power state. They’re not things that you’re going to work out between Microsoft and Nokia. It’s going to be between Nokia and AT&T.
Kaiser: Does it matter? How often do you associate with a particular cell station? It affects standby time, but standby time is already pretty long. Does it really matter if you optimize that case, or do you care about other cases? How much of your battery went into this handshake?
Rowen: With the scenarios I’ve seen it could matter a lot.
Hardee: If you change the data arrival rate to those processes that are rendering Web pages, it’s a big difference. You could be running your graphics processors continually just because you have a slow data arrival rate, as opposed to processing everything and shutting down. It would be difficult for the software guys to optimize for those cases. What they can optimize for is how predictable stuff is. Can you do predictive scheduling? That changes what the application is doing. Those decisions are set pretty low down in the software stack, but what’s available to use and how effectively it can be used is another thing the software engineer has to think about.

LPE: How much of this information is making its way between hardware and software teams?
Kulkarni: That’s where virtual platforms come in. A co-simulation platform is a better description. But the marriage of the software with the hardware and how we capture that in instrumentation then can be driven toward a meter, which may be RTL power, a hardware description. But it all has to convert into power analysis at the end of the day. The feedback can be given to the system designer and the software designer, but all those things are missing. What Carbon is doing is an important step toward that. You can do the power analysis and get that feedback. We have to look at the application over time, and the feedback has to be in real time. In one of our customer applications for digital TV, they asked us if your eyes are looking at the oval in the middle of the screen can you turn off the power at the edges. They’re looking at pixel-by-pixel power control. This is real-time feedback of hardware and software applications.
Kaiser: You can re-encode movies based upon brightness. If it’s pretty dark, you can show it with much lower backlight. The backlight can vary and the screen looks the same. And it can vary by region. That’s beyond the scope of hardware. It’s algorithms.
Kulkarni: This customer is looking for software energy-reducing concepts. They want to know where their software is consuming more power.
Kaiser: They want the drivers. And if you’re going to be varying the CPU, then you also need to provide the compiler.
Rowen: Depending on what level in the system you’re talking about, the hardware has always provided the software. We’re doing a lot of advanced baseband design. The next thing after the industry specification that you do is make it happen in 150 milliwatts at 300 Mbits per second. That drives all the subsequent design, including the choice of algorithms, the processors, the allocation of memory and the interconnect. They’re all driven within a power budget. Everyone working at layer one knows the power. This very tight hardware-software co-design is very established there. It starts to loosen up as you go up, in part because you’re aggregating these much more complex systems together.
Neifert: That’s where it’s missing. The power is really a system context. Five or six years ago I started getting inquiries from leading-edge customers. A couple years later it was leading-edge research groups. About two years ago it made it out of research, and now about 30% or 40% of our customers are doing this in some way. It’s of great importance now.
Hardee: We all tend to gravitate toward the simulation model or the virtual platform’s ability to do power estimation. That’s not actually the low-hanging fruit, though. The thing that can be done relatively simply is system integration testing of power management software. Can you switch the mains on and off? Is it idle when you think it’s idle? That’s a lot lower-hanging fruit in a SystemC TLM 2.0 modeling environment than in power estimation. For power estimation, we have a ways to go even in the activity formats used. You have to use averaging formats over defined windows. These all apply at the signal level. How do we bring them up to the TLM 2.0 level to make them run faster? That can be an issue. There are circumstances where you can say you have an AXI protocol and 64 bits, and you can do the math to get from signal level to architectural level. But then you look at all the architectural differences that start to become nuances in that model, like whether you’re doing split transactions and how are bus transactions being pipelined. Is that being correctly modeled in the platform. There’s a lot of complication. Even to get relative accuracy you will need to model this.
Rowen: We’ve gone up halfway between this signal and toggle level and TLM. Processors are nicely defined. What we’ve done is to automatically derive instruction-execution-level energy models so we can, as part of the initial instruction set characterization, come up with a pretty good energy model per execution. It’s still data independent, but there’s a summary number. The simulator knows how to count things like memory references. Then the whole processor plus memory subsystem has very accurate relative and kind of accurate absolute energy at a level that runs at the full speed of a fast simulator, not at RTL speed. Therefore you can start to make that a building block within a transaction-level approach. That’s one of the pieces of raising energy in abstraction and getting past the toggle.
Neifert: You start doing toggles and you slow everything down. You may use the toggles as an instrument for calibration, and then you go back and put that in and say, when I do this I take this much power per cycle. Then you can start aggregating some of those numbers to at least get a relative figure.

When Worlds Collide: Saving Power In Communications Applications

Thursday, January 12th, 2012

By Ann Steffora Mutschler
The interplay of hardware and software is a given in every device that contains a semiconductor chip, but is typically felt more acutely in communications applications given the extremely close dependencies for everything power-related. Managing power in these situations just gets more challenging as consumers demand more and better applications on their tablets, smartphones and other mobile devices.

Power is always one of those things that needs to be addressed at many levels simultaneously because there are both raw technology factors: the semiconductor technology as well as system issues—what algorithms you run, what is the collaboration that takes place between handset and base station, stressed Chris Rowen, CTO of Tensilica. “All of this can have significant effects on what the total energy consumption is of the system. As you go up in the level of abstraction you move away from the individual transistors and talk about what the system behavior is, so you can get larger and larger relative savings.”

This is because if the system can be organized such that no communication is required, or one bit of information tells the whole story, then gigabytes of information may not have to be moved across the network, he said. The problem is that the engineering team must know exactly what single bit tells the story. “There are many things that people do in finding ways to store data instead of communicate data, to encode data more cleverly, to make data communication more resilient so that they can avoid doing work, avoid doing communication processing and therefore save lots of power because the radio never has to go on, or has to go on very infrequently and the encoding can be greatly simplified.”

“At the other end of the spectrum,” Rowen continued, “there are many things that you can at the level of the circuit design, the logic design, the processing architecture, which can significantly reduce the power as well—even once you accept that certain standards and certain communications protocols can be used and the intelligent chip architect or system architect is aware of when they have to live within the ground rules of the standard that they are implementing.”

Techniques for saving power
From a software perspective, power-saving techniques are being driven by emerging new architectures, such as ARM’s big.LITTLE, which is where there is a companionship that is able to take over the system when there are low-performance requirement and high-energy requirements, and there are high-performance requirements and faster CPUs, said Achim Nohl, technical marketing manager for Synopsys’ solutions group. Within this approach is the new concept of switching from—on a very coarse grain level—a high-performance, high-energy profile CPU to a low-performance, low-energy footprint CPU.

“At the same time,” he said, “there is an orthogonal technique for power saving—dynamic voltage and frequency scaling (DVFS) where in parallel to big.LITTLE you are able to scale down the frequency of a cluster or of a single CPU. That can only be done by predicting what the performance requirements are for the specific workload to perform a just-in-time completion of a specific task. I need to know how much processing power will be required in order to satisfy this task so that it can be computed just-in-time.”

There’s a lot of impact on the software and on the whole workload prediction. Schedulers must become power-aware. There is also a contrasting scheme called “race to idle,” where rather than scaling voltage and frequency you run as fast as possible and then remain in idle mode as long as possible. But these solutions are hard to evaluate against each other because they are highly scenario-dependent. Scenario means the software and the whole user scenario, Nohl said.

Rowen pointed out a hardware technique for power savings that is gaining steam is the more careful adaptation of the processor engines to fit the tasks, because there are very distinctive things that you do in some parts of the receiver such as FFTs, while in other parts there is a lot of filtering, and in other parts there is forward error correction, which is a successive approximation method for determining what the signal was.

For those at the sharp end of silicon platforms for mobile devices, Pete Hardee, director of solutions marketing at Cadence observed that semiconductor and systems companies are seeking all the power saving techniques they can get. “This is where people have been using the regular techniques like power shut-off. We’re going to see, as well as power shut-off, a lot more use of DVFS – that’s certainly going to be seen a lot more as people struggle with power.”

One interesting technique as designs go from node to node is substrate biasing, which has been used effectively at earlier nodes like 90nm. “Once you get under 90nm there is a lot of debate as to whether or not substrate biasing is an effective technique or not. It is applying a negative bias to the body of the silicon, which reduces leakage especially when you’re at near-threshold voltages. We see substrate biasing being used even at very deep submicron, especially in relation to standby modes of memories that reduce leakage. [But] there is a lot of debate on the effectiveness of substrate biasing beyond standby mode of memories and the reason is there’s a routing issue that all of these bias signals. Per transistor, you’re effectively supplying a bias supply to each transistor of the chip so that gets very expensive from the power routing point of view and you start to hit routing congestion and so on. We see people using substrate biasing at the 40nm node and then it gets a lot fewer at 28nm and people are starting to wonder if its going to be effective at 20nm (22nm for Intel). One thing that we’re figuring out is that finFET probably obviates the need for substrate biasing. You’ve got a lot better control of leakage due to the topology of the gates through the 3D construction of the finFET transistors. When finFET becomes the norm, we think we’ll see the end of substrate biasing as a technique.”

Intersection of low power and test
While test is not so much an area where much can be done at this point to help save power, it is nonetheless an important part of the design process with unique issues, noted Greg Aldrich, director of marketing for the Silicon Test Systems group at Mentor Graphics. “The two aspects that we have to deal with for low power are first, when we are inserting structures into the design, are we consistent with all of the power intent or are we respecting all of the power intent, the power island and what is required for the low power design when we are inserting logic? Are we making sure that, for example, when we are inserting compression logic into a design that may have three or four different power island or power partitions that we’re not crossing those power partitions with the test logic that we insert or that we’re properly isolating those power partitions. If there’s a constraint in the low-power design where I can’t power up all of the three partitions at the same time, I need to be able to test it in that manner, as well.”

The second and, he said, maybe more concerning issue is then when testing a low-power design the tests cannot overstress the power design. “If it’s a low-power design, typically that means that there is very little switching activity, a lot of the design is turned off and the power rails, the power system is designed that way. When you do testing typically you want to get as much activity as quickly as possible so that you can test the device as quickly. You’re testing every single device you manufacture, so every second you spend testing that device costs you more money in the manufacturing cycle.”

In test the objective has always been to get as much activity in the circuit as possible in order to test it as fast as possible, but for low-power designs that approach could damage the device.

At the end of the day, the biggest problem in looking to save power in communications applications, according to Marc Serughetti, director of product marketing for virtual prototyping at Synopsys. “It’s not about hardware, it’s not about software. It’s about the two together, and when it comes to software it’s not about the low-level software either, it’s about the entire software stack because a simple application can create a significant problem when it comes to power consumption. Now you are talking two different worlds once again colliding, and if you approach this purely from a hardware perspective you are going to end up in situation that may sound interesting for the hardware people, but when it comes to the software world where you need to be able to run Android or Windows Mobile, the performance of the environment you need to use are a significant component that must be analyzed.”

Experts At The Table: Making Software More Energy-Efficient

Thursday, January 12th, 2012

By Ed Sperling
Low-Power Engineering sat down to discuss software and power with Adam Kaiser, Nucleus RTOS architect at Mentor Graphics; Pete Hardee, marketing director at Cadence; Chris Rowen, CTO of Tensilica; Vic Kulkarni, senior vice president and general manager of Apache Design, and Bill Neifert, CTO of Carbon Design Systems. What follows are excerpts of that conversation.

LPE: Software is causing as many problems in low-power designs as anything else. How do we fix that?
Neifert: Software is creating problems everywhere because software is driving everything. Customers are using software for more and more stuff, including software-driven verification and architectural analysis. It’s only natural to take traditional back-end tasks such as power and move those forward to enable that analysis to be done earlier in the process and see how the software will impact the system. As software increasingly dominates power usage, because the processor and IP guys are getting smarter and smarter about turning things off when they’re not being used, you can’t just blindly use some verification vectors. You need to really see how it’s being used by the software. That requires virtual prototypes in conjunction with power tools.
Hardee: The software is controlling everything, but in recent years we’ve seen an explosion of applications doing all kinds of things you want to do on mobile devices. One thing we see people struggling with is the best way of optimizing for power for different applications, which have different needs. If you’re watching video, you have a frame rate to deal with. Rather than shut down, you’ll probably want to optimize framework lengths or run as slow as possible and make sure memory transactions are being accessed from cache. That’s completely different when you’re on the Web, where you want to run as fast as possible to get a good image and shut down. What you’re actually doing with a device will require very different power strategies.

LPE: So how do you analyze that?
Hardee: You have to run a lot of cycles in the various system modes to be able to model it. That’s where it really gets very disconnected from the test vectors in logic simulation. You can’t use those anymore. You have to have a set of vectors that is representative of the various application modes you have on the device. That’s a big change for a lot of people.
Kaiser: If you’re going to have software guys doing power optimization, you need to get them some kind of metric to measure or estimate the power. If you go to a software engineer today and ask them to optimize power, how do they know if they’re doing better or worse? Giving them a power meter helps. Giving them a power meter that self calibrates helps a lot more because the software guy really doesn’t know what calibration is for A-to-Ds. He sees negative current and figures he must be doing something wrong because he’s charging the battery. First give these engineers something to measure. Second, when you’re doing a product you have to identify all these different use cases. Internet surfing is one. Video playback is another. Software teams need to optimize individual use cases. But to do that, you have to figure out how much you’re going to be doing of what. A cell phone is sitting in idle 99% of the time, but 99% of the energy is not used on standby. So you have to look at the different use cases and figure out which use case you really want to optimize.
Kulkarni: What we’re really concerned with is energy consumption over time. Instantaneous power, dynamic power and static power are well known, but energy consumption over time is where software enters the picture and turns it into system-driven power consumption analysis. If you look at power times time, it’s the clock frequency and the duration of the cycle. That’s where the most of the applications are causing all these headaches. Why does a GPS application versus YouTube versus music have so many energy profiles? We can control instantaneous power and dynamic power quite well at RTL and below. Static power can be controlled quite well. But testbenches that look at the overall functional verification are not relevant anymore because you need to look at the states, too. And depending on which application is used, the energy will be different. We need to instrument that with respect to the power states. How you create the right test vector set or testbenches becomes a real challenge. Then, looking at average power and average voltage versus average current over time. That’s where the issue of co-simulation comes in. Running the complete simulation on a virtual platform becomes more interesting. At the moment, instrumenting the software is not possible.
Rowen: This is a serious problem. One problem is that the software guy has a simplified view of all the clever things the hardware guy did to make it possible to reduce the power. The hardware guys are obsessed with power these days. Not all of what they come up with is good or practical, but they are at least thinking about it. They have power modes, techniques for certain operations or instructions to meet power requirements. That may be pretty far removed from what the software guy, who’s on the front lines of getting a product out the door, has visibility into. It’s made worse by the fact that most the tools people have work well in a simulation environment, but often what happens at the end of the day is you have a programmer, a prototype board and an ammeter. It’s a very crude picture of what’s going on. The poor software guy is trying to figure out what he can do overnight that will move the ammeter.

LPE: We have no standard gauge for software. It varies by application, middleware and operating system as well as by the usage from one person to the next. How do we deal with this?
Hardee: If you look at the need for system vectors that reflect the application you’re running, that becomes a problem when you’re trying to provide a power meter for the software engineers. That works well if you can run actual software against the target device, which is the only chance where you can run it fast and get accurate readings. You need accurate vectors and you need accurate characterization to get any sense of power or energy. The difference between power and energy is you need to know over time what the system is doing and model that correctly. Virtual platforms have some potential to help, but they’re problematic. If you’ve got the kind of virtual platform that runs fast enough to make a software engineer happy, you’ll be modeling that at an untimed level. At the untimed level the virtual platform is instruction accurate, so it’s getting its timing and instruction cycles from the processor. If you think about what’s going on with software and the choices to process things, can you do it from cache or do you have to go out and fetch it from some other level of memory. Those have a huge impact on the number of clock cycles, rather than instruction cycles, that it takes to perform those tasks. So the point is that you need a lot of timing accuracy before you can get any kind of energy accuracy. That’s difficult to build into a virtual platform.
Kaiser: You don’t need the actual numbers. You just need to know if it’s getting better or worse. You can give the software team a relative number. Second, you can start doing estimations. With an MP3 you want to know what your cache-miss ratio is.
Hardee: You need to start to measure the things that you know drive high energy usage, as opposed to measuring the energy usage itself. When you have 400% or 500% difference between cache and memory, it’s hard to put different algorithms in the right order. You don’t even have the relative accuracy you’re looking for.
Kaiser: So are you looking at platform-to-platform comparisons? I’m thinking you take the platform and get the software guy to make it as best as it can be.
Hardee: You’re coming it at from the standpoint of post-hardware. How does the software guy optimize his programs?
Kaiser: Or even if you don’t have the physical hardware yet.
Hardee: I’m approaching it from the view of the system architect designing a new system. How do they know they’re going to meet the power spec? If you’re rendering graphics versus video, you have to be running the right algorithm on the right core. There are multiple choices, and you have to figure this out even before you measure things relatively, let alone absolutely.
Kaiser: That’s is a system architect challenge, not a software challenge. The most the system guy can do is identify the best possible scenario. The software guy may or may not come close. Sometimes it may not even be possible.
Hardee: And you can really mess up the software. The system architect does have to make an assumption what he’s building into the system architecture will be used efficiently by the software guys.
Rowen: You really want someone who has a deep understanding of what instructions to use, what compiler flags and power modes should be used, and what is the realistic scenario that will contribute to the worst case. In general, things are better when more things are programmable. The worst thing is where the controls are inside some obscure, hard-wired function unit. We had a big customer recently that had trouble meeting power goals. The fact that they were using programmable audio made it a lot easier to come up with another way to buffer the data and initiate the applications.
Neifert: When you have the chip, at least you have an ammeter sitting there. Before you have the chip, the software guy is in the dark. He often doesn’t have any indication of what’s going on in the system from a power perspective because there isn’t much there to tell him that.

Making Software Better

Wednesday, January 11th, 2012

Low-Power Engineering talks about what will make software more energy-efficient with Pete Hardee, marketing director at Cadence; Adam Kaiser, Nucleus RTOS architect at Mentor Graphics; Chris Rowen, CTO of Tensilica; Vic Kulkarni, senior VP and General Manager of Apache Design, and Bill Neifert, CTO of Carbon Design.

YouTube Preview Image

Getting The Balance Right

Thursday, August 11th, 2011

By Ann Steffora Mutschler
Defining the power architecture for a low-power design means striking a balance between the high-level abstraction and measurements made typically at RTL and below, but today that is easier said than done.

“The balance is that at the high level of abstraction, the design choices you make have a big effect over power, yet your ability to measure them is incomplete until you get much further down the design flow. That’s a balance that people have to strike and it tends to be a problem,” said Pete Hardee, director of solutions marketing at Cadence.

What works best at the high level of abstraction is the ability to run real system modes and get real activity vectors, which are becoming increasingly important. It’s actually better to take that information at the earlier abstraction level when a lot of data can be run. Software can be run on a virtual platform or an emulation box, which provide activity data. “It’s important to understand the modes because of all the complexity—the different power modes that a system is in relating to all the different system modes that need to be covered,” Hardee said.

The other piece of the equation is the characterization of every time there is activity, every time switching occurs, and what that means at the device level in terms of power. The problem is that characterization often isn’t available until later on in the design process.

“RTL is a good place where those come together,” Hardee noted. “Above RTL, we’re often guessing at that. If we can get to at least a relative ranking of the various architecture changes you have in mind, then you’re doing really well. And that’s all the above-RTL or system-level guys are trying to do at that stage.”

Fortunately, derivative designs allow you to get a little bit better than that, because if a similar platform has been done, there is probably some good characterization data from a previous design so that can be used formally or informally.

For any level of abstraction, the most important thing is to understand the limitations of the model, said Cary Chin, director of technical marketing for low power solutions at Synopsys. “Models that are used as intended can be quite accurate, but accuracy tends to drop off quickly if the assumptions are not met. For example, a high-level model for computing dynamic power based on transition frequency might be very accurate when a block is in normal operating mode, but in some special power saving mode the assumptions might need to be specially validated or the model adjusted or extended.”

Exactly what are we measuring?
“When you are measuring power, you are doing two different calculations–in a certain amount of time, with a certain amount of load, how many transistors are flipping on and off. Each time you do that is the act of power. It’s a very interesting problem to solve because nowadays it’s not just performance. [It’s about] how do you do it at a high level so you can get an architecture before you go down to the details. You don’t want to try it and see,” said Kurt Shuler, director of marketing at Arteris.

Models are the way to go from the high-level, and are typically validated against simulation at the lower level, Synopsys’ Chin said. So a block-level IP power model could be checked against a gate level analysis to verify correctness in multiple modes of operation. Similarly, gate models are validated against circuit simulation, and so on. “At each level, it’s important that the validation be as exhaustive as possible (including some measure of completeness) in order to build confidence at the higher levels of abstraction,” he said.

This data is generally available, but the model accuracy varies as the model is tuned. “Determining an accurate and compact set of parameters for any model is the ultimate goal, but that’s easier said than done. We learn by experience, applying new information to refine successive versions of the model to achieve better accuracy over time. The usual tradeoffs apply—time vs. space vs. accuracy,” Chin observed.

Captured within those models are dynamic and leakage power.

“It used to be that you needed activity to measure the dynamic power and leakage power,” Cadence’s Hardee said. “What’s changed is that now we have leakage increasing in today’s advanced nodes, and that has led to techniques specifically to control leakage like power shutoff. You’ve got to remember that the leakage calculation depends on the system modes and how long the blocks are shut off for, and that has to be factored in.”

That can be done at a number of levels—running either system software on a prototype or system software on a previous version of the chip if it is a derivative. What you are looking for are typical usage scenarios, such as how long you are in each of the identified high-level system modes, and what’s on and what’s off. From that you can create profiles, which in turn can be used to measure dynamic power and to affect leakage power.

The software perspective
Considering power consumption from the software point of view, Marc Serughetti, director of product marketing for virtual prototyping at Synopsys, noted that open software platforms such as Android have unlocked smart phones to a worldwide community of open source software developers.

“While users clearly benefit, what is the impact on power and battery life?” said Serughetti. “Power efficiency is becoming a key issue for software developers, and important quality criteria for their software. This impacts all the layers in the software stack. All layers need to be well integrated from a power management perspective and all functional entities contained in these layers need to cooperate. The big challenge for software engineers is getting insight into how well the system is performing in perspective of power.”

Here, virtual prototypes are useful as they provide a means to access such information as long as the information is available from the virtual prototype model. To be sure, advanced low-power techniques will soon be ubiquitous not just in mobile designs but in all designs: consumer electronics, data centers, and many other areas. Once stable they are expected to be widely available.

Customer Perspective: STMicroelectronics

Thursday, August 11th, 2011

By Ed Sperling
Philippe Magarshack, group vice president for technology R&D at STMicroelectronics, sat down with Low-Power Engineering to talk about some of the fundamental changes ahead in how SoCs are designed, built, how they perform and what steps can be taken to speed time to market.

LPE: What do you see as the biggest changes ahead?
Magarshack: One is the sheer size of the ecosystem and the relationships you will need to have moving forward. We are already dealing with this at 28nm. At 20nm, we see this is as a tremendous challenge. We are not alone. The competition has to deal with this, as well. We have the size to justify moving to 20nm, and 14nm in the future. We also have a network of foundries, not only for manufacturing but also for process R&D, and we are able to do some of that process R&D internally. When you look at the holistic cocktail of components needed to move forward, this is very challenging. But we are also one of two players that can actually take advantage of that.

LPE: Who’s the other one?
Magarshack: Intel. Even though they are coming from the high-end microprocessor market, they are certainly very serious about moving toward systems on chip and lower power. They aren’t there in terms of low power, but they are very focused. There is also R&D among big established players.

LPE: Where does stacking of die fit in?
Magarshack: There is a lot of buzz about stacking of die and TSVs. We are not quite there. But the trend toward system-in-package, which may include 3D stacking or side-by-side or some other combination, is very strong. We are using that for our set-top box and digital products. We are concentrating the pure digital design on 28nm and using to our advantage lower cost and more efficient processes for all of the analog systems.

LPE: What node is the analog in?
Magarshack: It’s typically 65nm, moving into 40nm now. Over time we have perfected the ability to integrate two die together in a package, minimizing things like power and crosstalk. The overall system cost is not the only benefit at the end. In terms of program management and schedule, you concentrate your teams on the digital and making the analog IP work. We see this as a benefit in time to market. We also can swap the analog out without having to wait for the digital die. So while we don’t see a strong market case for 3D stacking, at least for the next two or three years, we do have the capability in-house, as well as with our foundry partners, to make this happen. We already have prototypes. The first products we see will be DRAM with wide I/O.

LPE: What’s the perceived benefit of system-in-package with wide I/O? Is it better power management, better utilization of cores, performance, or cost?
Magarshack: The cost is not a driver. At this point it will be neutral, at best. The number one benefit is the memory bandwidth. When you have intensively used CPU cores or graphics engines, they need to access 1,000 or 2,000 bits of memory. This is something enabled by wide I/O. And in terms of total system power, for the same quantity of data being transferred from the DRAM to the chip, the wide I/O drivers are much smaller. The distance is smaller so the overall power of the system is lower. That’s the other big advantage. One technical change that has not been addressed is that, as a consequence of this intense activity on the DRAM and the processing engine, you have a temperature elevation. Removing the heat is one of the big issues. We are working to simulate the heat effects and to have ways of dealing with it.

LPE: What’s new, though? We’ve had memory on the chip, on the board, and now we’re playing it somewhere in the middle.
Magarshack: We may be able to remove some levels of cache. That improves the system response in case of interrupts. The size of the DRAMs also is increasing at the level of Moore’s Law. We have not been able to take advantage of all these bits and DRAM except through the I/Os, so for me wide I/O is a potential architectural breakthrough. Within two or three years this will be on the table and it will be at the forefront of SoC technology.

LPE: What’s the big hurdle going forward? Is it the hardware or the software?
Magarshack: We have come a long way in optimizing pieces of the hardware and the software. There is still more that can and should be done in terms of validating the software before the hardware comes out. We also have come a long way in virtual prototyping where we can boot the operating system and debug the device drivers before we develop SystemC models for the IP, or a combination of hardware emulation of the other IPs. Once we get silicon, we can plug it in the board and boot the OS within hours, and have significant applications running within days. After that, applications can run in real time before we find problems that need to be fixed. But we can take advantage of the silicon as soon as it comes out and go into production soon afterward. We still have to wait between 6 and 12 months for software bring-up and testing of software in the customer’s environment. To me that has to be shortened even further.

LPE: Will 2.5D allow some of the software to be re-used, as well?
Magarshack: If you have an adequate digital chip and you need to add another interface, you may only have to touch one device driver.

LPE: Will this be available from other vendors or will it still be ST developing the software, hardware and IP in-house?
Magarshack: We do see this as a differentiation for us. We have applications ranging from GPS to set-top boxes and WiFi, and we are able to bring in the hardware as well as the software part. There are millions of lines of code developed for systems on chip. We do most of that internally, but we are looking for and finding external partners, as well, for things like device drivers. We also are looking at what open source can bring us. This is easing the burden of software development. But the integration goal is a differentiation, and we will not hand off this function to an outside partner.

LPE: Will you have a standardized way of building and packaging these kinds of chips?
Magarshack: Yes, and we are one of the few that have the breadth of applications from consumer to auto to wireless. We are taking the approach that business units within ST are exchanging IP, both hardware and software, among each other. We have IP in our GPS group and they are moving IP into wireless. WiFi expertise is moving toward automotive. We definitely are doing these exchanges. For this exchange to work, we are working on internal IP standards so you can plug it in and re-use it. We have the SPEAr (MPU) family, where we have processor cores and GPS or other modulator IP, encoders and decoders. This can use an undifferentiated block that can be configured by the customer. They can put their own proprietary IP or then can ask us to provide it.

LPE: So to some extent you’re breaking your products down as platforms and subsystems, as well as lots of other IP?
Magarshack: Yes, and this is not just the hardware IP. It’s hardware and software.

LPE: Is it a combination of cost and speed to market that’s driving this?
Magarshack: Time to market is the most important.

LPE: What happens to your ecosystem? Does it grow or shrink, and does it get tighter?
Magarshack: We tend to have deeper and stronger relationships with fewer partners. We have moved to become part of the Common Platform alliance. We work with IBM, as well as the foundries of silicon, in addition to our internal fabs. We have developed intense relationships with these companies. We also need to have a strong relationship with the back-end assembly partners. Each part of the supply chain has to be extremely reliable and committed and focused. We need tight delivery dates across the entire food chain.

LPE: Any changes in materials that will be used?
Magarshack: Right now we are moving forward with a differentiated approach. We are building on top of the bulk CMOS. At 28nm we are moving forward with fully depleted SOI. We believe this will bring a very strong advantage for us of higher performance at the same voltage or lower power. We are now moving to take advantage of this with our partners.

Power and Performance in Architectural Migration

Thursday, July 21st, 2011

By John Blyler
It’s no secret that today’s market favors electronic products that use less power while providing ever-greater feature sets at higher levels of performance. These conflicting requirements have caused many embedded hardware and software developers to consider competing processor architectures for their next design iterations.

Architecture migrations are tricky because they involve taking software designed to run on one computer hardware and porting that software to execute on a totally different system. Migrating software to run on a new processing platform can be risky and time consuming. Many factors must be considered, such as the choice of an operating system (OS). Common concerns center on the best way to optimize both power and performance during the migration. Among other issues are the available tools for debugging in the new environment.

These questions and many others are addressed in a new book by Lori M. Matassa and Max Domeika, which was published by Intel Press. Titled “Break Away with Intel ATOM: A Guide to Architecture Migration,” the book obviously focuses on migration strategies from competing platforms to Intel’s embedded ATOM processors. Still, an interview with one of the authors reveals a variety of useful development tips that apply to architectural migration in general.

LPE: What motivated you to write a book on migration strategies?
Domeika: Lori and I saw a need to help embedded software developers and engineers migrate to the Intel Atom. Over the last several years, our customers have asked many questions about the details of things that they need to know to be successful in the migration. Lori and I wanted to document all of these questions and answers in one place to benefit a broad spectrum of people—from managers considering migration to the engineers that have to do the work.

LPE: Which competing processor platforms are covered in the book? Also, will the migration be to a single-core Atom or the new double-core version?
Domeika: We primarily cover migration strategies from the two big architectures of PowerPC and ARM. Many customers have experience on these embedded architectures, but now want to explore a move over to the Atom processor. Some developers want to know the low-level architecture details, such as the special features of x86 assembly language. Other details we cover include the pros and cons of an in-order processor instead of an out-of-order processor like our other, bigger processors.

Many questions center on the issue of porting existing software to a new platform. One common issue that we see from customers deals with byte order. How do you migrate from a larger processor architecture like PowerPC to a smaller one like the x86? The multicore question is a challenging one. Once the software is migrated to the Atom processor, it’s easier to take advantage of new multicore platforms. In general, embedded developers are still learning the advantages of multicore systems. One of the challenges is that no one roadmap exists for customers in the multicore space. We still have customers who are struggling with the same multicore issues that we were talking about two to three years ago.

LPE: The cover of the book has pictures of tablets, nettops, and smartphones. Do these different development targets have different migration strategies? Or are the differences minimal—confined to hardware-specific issues like display screen resolution and memory?
Domeika: Every migration has both common and unique aspects, which made writing the book a hard task. You don’t want to be so specific that you have things that only apply to one person. On the other hand, you don’t want to be so general that it applies to nobody. Lori and I have done our best to try to generalize and discuss the key topic areas. As I mentioned, some folks want to know about the low-level details. But many don’t need the low-level details. These developers don’t need to know the details of assembly language or the Atom architecture, especially if they’re application developers coding in a higher-level language like C++.

OS issues are common to most customers. Some use commercial-off-the-shelf (COTS) OSes that make certain tasks easier but other tasks harder. Other folks are bound by a proprietary OS that they need to port. Proprietary OSes bring in other system-level and assembly-language issues in terms of device drivers. So it really depends. We try to be general enough to suit the needs of many folks, but provide enough detail that it’s of some value.

LPE: Let’s talk about available migration tools.
Domeika: Historically, Intel’s tool focus has been on best performance (i.e., trying to get the most optimized performance). Our compiler engineers sit closely with the Atom architectures, so we’re able to design compilers that know the Atom internals and create very fast code. Similarly, our profiling tools are tuned to watch for events that have more or less impact on the processor. One of the big embedded tool areas is power optimization. How do you optimize the processor for power? There are tools available now and some coming out later. One of the currently available, common open-source tools is called PowerTop.

There were many demonstrations at the last Embedded Systems Conference (ESC) that relied on external electrical devices to measure power. These devices had external probes that would monitor the power on a chip or board. PowerTop is different. It’s a software tool that monitors the idle states of a processor—specifically, the C and P states—while the software is running. Idle processors use less power. PowerTop monitors the processor as it is transitioning between its C and P states. Knowing the transition timing allows the designer to figure out what part of the software is causing the processor to wake up. Too many interrupts may increase the system’s power consumption. The software developer can use this information to determine if all of those interrupts are actually needed. Perhaps fewer processor interrupts can be used. Sometimes, the solution is silly things, like insistent polling of the processor by an application. One solution may be changing the polling behavior or even moving to a different processing architecture.

LPE: How about chip power-management systems that are based on real-time operating-system (RTOS) software control (e.g., turning off specific sections of the chip as needed)?
Domeika: Those low-power techniques are certainly useful. However, my focus has been on the software-development side. Many developers don’t want to go to a deep level of detail. This has been an eye opener for me—a realization that has caused me to think in a new way.

While there are ways to micromanage the chip’s power usage, it’s usually more efficient to simply let the chip manage the power at that level of detail. A great many power decisions are controlled by the processor. We’ve found that application developers don’t have a big desire to manually tell the processor which sleep state to enter or when to wake up. Their interest is at a higher level, such as deciding how often to interrupt the processor.

This is analogous to threading issues in multicore processing. Multicore threading is considered the assembly language of multicore programming. Here too, questions arise as to whether it’s better to have libraries that address most levels of power and performance issues, so developers can focus exclusively on their software applications. Not surprisingly, mainstream developers want things to be easier.

LPE: Are there any third-party tools that can be used for multicore design?
Domeika: The book also covers some third-party tools. One such tool is called Prism by CriticalBlue. This tool supports multicore programming on embedded processors by allowing users to play “what-if” performance scenarios. For example, what if you were able to make a section of the code run in parallel? How much faster would the code run across four codes? What are some of the potential issues that you’d have to worry about if you’re going to make something run in parallel? Common issues include the use of shared variables and parallelism, concurrency concerns, and ensuring that the code runs correctly.

The Tao Of Software

Thursday, June 16th, 2011

By Ed Sperling and Pallab Chatterjee
As software teams continue to race past hardware teams in numbers of engineers, hours spent on designs and NRE budgets, companies are beginning to question whether there needs to be a fundamental shift in priorities and strategy.

The problem is that it takes far too long to write and debug the software and to get it working on the hardware, even with virtual prototyping capabilities.

“Bare metal software is the hard part of the problem,” said John Bruggeman, Cadence’s chief marketing officer. “It’s the bane of the embedded system company—80% of the time is spent getting bare metal software to run on hardware. It takes two to three months to get Linux to boot because there is no visibility into the software and the hardware simultaneously.”

That challenge becomes increasingly more difficult at each new process node, as well, because complexity is increasing on both sides. Bruggeman said there are three reasons solutions haven’t worked so far. One is that every solution to date has been closed or proprietary, which limits the number of programmers working on a solution. The second is that solutions today are fragmented, both by multiple vendor tools as well as some of the flows by single vendors. And third, the complex multi-geographic development coupled with enormous scale and size has not resulted in a coherent solution.

Cadence clearly isn’t alone in recognizing the growing problem in software, although it is the most vocal of the Big Three EDA vendors. All have major software efforts under way and have made significant investments in these areas. Mentor Graphics has a big push in Embedded Software and Synopsys has an equivalent focus on software prototyping. All have made acquisitions in their respective areas, as well.

But getting software to run more efficiently on the hardware is a different sort of problem. It’s understanding how the two interact at a very deep level.

Glenn Perry, general manager of Mentor’s Embedded Systems Division, recounts a story of one customer that was porting Linux to a chip and couldn’t figure out why the operating system was continually burning up energy. The culprit, as it turned out, was a blinking cursor.

“The goal is to put power in front of software,” said Perry. “When we do that with a regular optimization of Linux we see a 70% to 90% improvement in power. We need to fix the simple stuff first, and this isn’t so easy. What we’ve found is that embedded developers know very little about software.”

Power games
But if hardware engineers know little about software, the reverse is also true. One of the biggest demands for improving the efficiency of software comes from the gaming world, where software typically has been written in a high-level language with little or no attention to power consumption. In gaming, the user focus always has been on performance—both in speed and in resolution—rather than power. But as more games are being downloaded onto mobile devices, that perception has changed dramatically. No matter how good the game, if it drains the battery in 20 minutes no one will buy it.

The result is that power controls need to be specified in the code, which is difficult considering the growing demands on these systems. Most online gaming is done at 720p resolution due to bandwidth limitations, with a typical compression of 1 I frame for every 200 P frames as part of the H.264 codec.

Mobile platforms typically code in OpenGL while 3D games use OpenCL. These games use a shader, 3D render, and main graphics display engine for the iPhone, iPad, Samsung phones and tablets, LG phones, Motorola and Droid phones, Asus tablets and the Motorola Xoom. Several mobile gaming companies (France, Itally, Finland, Sweden) are now developing products for Q4 release using OpenCL for the Imagination Technology PowerVR core.

The challenges are growing from there, as well. Several major software companies, to provide a higher quality visual experience, also have written a new codec for use with the Xbox360 and PS3 platforms. These new codecs handle a different raster and render routine that supports both physics-based graphics generation (fire, rain, water, snow, wind, explosions, and striking reactions from swords/sticks/knives) and secondary scan for background details (flowers on trees, multi-color grass, flowers and moss on the ground, details on reeds, etc.) in addition to the normal patterns. The new codec was needed to be able to send and render the data in the standard data stream size.

Which comes first
So how much is all of this really going to affect design? Despite predictions that software engineering teams would displace hardware teams, the reality is that both will be forced to co-exist. They will never actually speak the same language or work on the same exact project, but the push is to improve communication back and forth between them. Software needs to become far more power-aware, and hardware needs to become more efficient at running software.

The last time the design world dealt with an issue like this was when the battle over RISC vs. CISC—reduced instruction set computing vs. complex instruction set computing—was being waged. That was in the 1990s, when Unix first posed a commercial challenge to operating systems from companies such as IBM, Hewlett-Packard, Digital Equipment Corp. and a handful of others that made their own OSes back then.

But power is forcing these issues back on the table once again, driven initially by the mobile sector and increasingly by devices with a plug. The likelihood is that it will never be a perfect marriage, but it is one that is likely to last this time because both teams need to at least have the same goal—even if they don’t talk the same language.

Power Bits: Feb. 18

Friday, February 18th, 2011

Ignoring The Rules
In a classic example of how technology gets used in ways for which it wasn’t designed, the University of Massachusetts at Amherst has been experimenting with running embedded flash memory at voltages lower than what has been recommended by a microcontroller.

Using software algorithms the team at UMass’ Department of Computer Science has developed what it claims are reliable storage methods at low voltages without modifying the hardware. This is an interesting development, but it also raises lots of questions about how IP will ultimately be used.

The researchers presented a paper on the subject at the Usenix conference in San Jose, Calif., this week, and said the energy consumed was 34% lower using this method. The question for companies evaluating this approach is what effect it has on performance and security– and what the tradeoffs are in terms of area and cost.

Unifying Power Intent
Si2 has released version 2.0 of the Common Power Format in an effort to bridge the gap between CPF and the Unified Power Format (UPF). Just for reference, Cadence developed CPF while Mentor Graphics and Synopsys support UPF. Both try to define the power intent of a design, but interoperability has created problems—particularly at the verification stage for fabless companies that rely on third-party IP and specs.

Smarter Windows
Philips Research has developed an “e-Skin” panel that switches from black to transparent using scavenged energy from a mobile phone’s RF signals. Aside from just being interesting, it’s particularly useful for smart windows in an office building, which can be dimmed when the sun is bright and clear when it is not.

Reconfigurable radios
Imec has developed low-power spectrum sensors for cognitive radios and networks. This is the kind of technology that will mean fewer dropped calls, no matter where a phone—or more accurately, a communications device—is used.

–Ed Sperling

How Software Utilizes Cores

Thursday, November 4th, 2010

By Ann Steffora Mutschler
When writing software, how does the design engineer determine how much power it will draw on a particular targeted platform? While the question seems straightforward, the answer is not.

The industry is just starting to develop the ability to get some data in that space,
according to Cary Chin, director of technical marketing for Synopsys’ low-power solutions group. “And when we can do that, then I think what you’ll find is mobile applications will actually be written differently than the ones you run on a laptop because they’ll be better optimized for power and may do things differently in terms of how you cache data.”

Getting to that point isn’t simple, though. Jason Parker, operating systems architect at ARM, said power-efficient software needs to be part of the design from the start. “Designers need to constantly ask themselves, ‘Is this the most power saving way of solving this problem?’ Trying to retrofit power management and efficiency into an existing design is hard work, and all the silver bullets were used up a long time ago. Multiprocessor designs open up additional techniques and constraints for power management.”

Understanding what happens below the surface is a start. Threads and processes are the software abstractions that represent CPU execution and the visible memory space. A thread represents the execution state of the CPU, e.g. program counter, registers and flags. A process is the constrained process memory space for one or more threads to execute within with the MMU used to provide this, he explained. There can often be more than one thread in a process, and they all share the same data.

In a single-core processor, the CPU is shared between the threads by the OS kernel scheduler, execution is managed by the scheduling of threads, determined by the thread priority and time slicing and switching threads is known as a context switch. In comparison, a multiprocessor (MP) combines multiple high-efficiency CPUs together that can deliver greater aggregate performance for less total power than a single high- performance CPU, and provide more power management options, Parker noted.

MP systems are divided into symmetric and asymmetric systems. “Asymmetric systems can have different OSes running on different cores working together to provide the whole system solution. An example would be a smart phone that has an ARM CortexA8 application processor for the Android user interface, and a different Cortex R4 processor running the real-time telephone stack in the RF modem, and additional cores for graphics, video and low-power audio. The advantage of these systems is the processors and resources for each subsystem can be tailored to deliver the expected performance at minimal power. The disadvantage is the system architecture is often fixed and may not be able to implement a future requirement, e.g. new video format.”

Meanwhile, symmetric systems run a single OS kernel across identical cores with a coherent memory system joining them together, Parker explained. “SMP OSes will run multiple threads simultaneously, aiming to share the workload over the cores within the cluster. Well-structured code and algorithms, that are parallelizable, are able to harness the performance of the multiple cores. Existing code and serial algorithms may not be able to take advantage of multiple cores. Power management systems within SMP OSes will control power consumption by scaling performance on the cores using DVFS, and will turn off unused/underused cores.”

Today’s complex SoCs contain a mixture of SMP and AMP subsystems, with power optimized for their respective tasks. For example, “a multicore Cortex A9 system provides the flexibility for an open-platform OS where the future application requirements are not known, whereas the CPU requirements for an LTE modem are known at design time,” he said.

Attaining optimal core utilization
But just understanding how the system is structured is not enough. To achieve the best utilization of cores by the software certain techniques should be implemented, keeping in mind that core utilization is driven by the subsystem partitioning and the further parallelizability of system code and algorithms. “The OS scheduler can maximize execution efficacy by keeping threads and their data on the same or local CPUs while application software can force this by the use of thread affinity,” Parker said.

Maximizing core utilization will drive maximum performance. However, it may not be the most power efficient solution for every silicon process, particularly those with the power management to optimize thread scheduling when the total required software load is a fraction of total performance. For example in a dual-core system where the total load is 80% on one CPU, key questions to ask are:

1. Does the kernel run one CPU at 100% performance, with the second one turned off?
2. Does the kernel run both CPUs at 50% performance, with lower frequency, voltage and total power?

In addition to subsystem partitioning there are other ways to optimize how software utilizes cores, but it depends on the tasks at hand, Parker said, including consolidation of multiple OSes onto a onto a single CPU or cluster using a hypervisor. Also, many instances of a virtualized OS can be distributed over many cores using virtualization, such as in the case of Web servers. At the other end of the scale, embarrassingly parallel problems can be handed over to a GPU, using Open CL for example in image processing.

“In the middle is where things are interesting,” he said. “How does an existing system scale across many cores? This is a 30-year-old challenging problem for performance, and more recently the power cost. Using threads is a workable solution for existing code and a few cores (less than eight), but they are hard to program. Measurement and analysis, as ever, are the engineering skills required. Without a very good understanding of your system it will be hard to make good use of multi-core hardware.”

When to use multicore
Everything is headed in the direction of multiple cores today, said Synopsys’ Chin, “As the frequencies on processors are continuing to be pushed up, that pushes technology further and further and makes the power problem worse and worse. The idea of trying to increase throughput or increase processing capability by duplicating cores to either dual-core, quad-core, hex-core or many more in some processing units has been the path that most of the processor manufacturers have been on. People have been talking about that for the last 8 or 10 years.”

“As a result, we see lots of processors—Intel Core i5, Core i7 kinds of processors with four and six cores pretty mainstream today and very interesting, although the architecture in mobile electronics hasn’t really gone that route yet. I’d say it’s more the idea of heterogeneous cores where you are using specific cores for more specific tasks. In a mobile application there is even more demand for optimizing the processor capabilities to the specific task at hand,” he noted.

Some applications do better in multicore environments than others, however. “The big difference between the kind of performance improvement you’re going to see with regard to a server farm versus a mobile device is that on a server farm the applications like virtualization, databases, and Google searches are algorithmically well parallelized and can be threaded easily. When you’re in a cloud or server farm environment you also have the benefit of having many, many users which provides another level of parallelization and capability with the overall farm,” Chin said.

In those environments, it makes sense to parallelize and have as many cores as possible because the whole idea of starting up the farm is to raise utilization. “The idea is to have your farm running at close to 100% utilization if you can, 24/7, whether that’s with online finance applications or Christmas ordering seasonally. And you want that to be balanced with usage from other parts of the world.” he continued. “With a mobile application there’s only a certain amount of threading you can do in the OS and in the applications that you want to run. On something like a smart phone the idea isn’t to have it running all the time. In fact, the idea is the opposite. You want it running as little as possible.”

Next Page »