Posts Tagged ‘multicore’
Power Optimization Drives Embedded And Multicore Software
Thursday, December 10th, 2009By John Blyler
Max Domeika, senior software engineer in the Developer Products Division at Intel, sat down with LPE Consulting Editor John Blyler to talk about the growing importance – and intersection – of both the multicore and embedded markets. What follows are excerpts of that conversation.
LPE: Intel’s software focus seems to be following its hardware processor drive into both multicore and embedded markets. What challenges does that bring for traditional software developers?
Max Domeika: Coming from a background in the desktop software application space at Intel I’m now spending more time working in the embedded multicore arena. This year I’ve been particularly focused on power issues, primarily on the Atom processor. My task is to see what sorts of tools are needed to help developers move to both embedded and multicore applications. Already I see a long term need for both power optimization and power measurement tools. The key is to monitor the power-related impacts of your application on the specific and overall system performance.
In the past, desktop clients and server users haven’t had to pay much attention to power. Desktop systems plug into the wall and their software applications use as much power as the processor wants to give them. Over the past several years these processors have incorporated features that help control the amount of power usage in both C (idle) states and P (operational) states.
How do these processor states affect the development of software applications?
In the past, processors either ran at full speed or idle. Several years ago hardware designers added features to the chips to control how deep of a sleep the processors are in. As you know, these features allow different portions of the chip and caches to be turned-off. One of the challenges is that while deep sleep saves more power, it often takes more power to wake up.
For the software developer, this means that you don’t want the application to enter the deepest C-state if you will have to wake up immediately. There needs to be some smarts as to how deep a sleep you go into, which is really an operating system issue. P states, or operational states, utilize varying frequency and voltage to balance the amount of the execution that the OS determines is needed. These states directly affect the performance of the system.
These states can have a big impact on the application, restricting how developers write the code. If your application causes the processor to perform poorly, that will have a negative effect on power utilization. Developer need more mature tools to help figure out which application processes result in effective use of the C-states.
I see a need for continued maturity of the power optimization and power measurement tools and methodologies. These power tools must also be tied to traditional performance analysis and optimization tools, because many of the techniques for mitigating power are the same techniques that you use for traditional performance optimization. The entire system must run as quickly as possible while using as little power as possible.
At the other end of the spectrum are the same issues of power optimization and performance analysis, but applied to a multicore environment. Tools and methodologies need to mature to include multicore development, too.
Are the two worlds of embedded and multicore coming together? After all, Intel’s Atom isn’t yet a multicore architecture, is it?
Well, some instances of the processor already support hyper-threading, a technology that dates back to the Pentium 4 processor. The key here is hyper-threading, which makes the environment look like two (or more) processors from the point of view of the operating system. That’s why the software techniques that developers use on multicore are starting to have an impact on embedded applications targeting the Atom processor.
Isn’t the low power push also affecting the high-end embedded and multicore processors like the Xeon?
Power is important across the board. We’ve seen power optimization become important in servers – especially with the “greening” of data centers.
Let’s talk about the actual tool environment for both power issues and multicore design. What’s happening there?
Today, I’d summarize the tool environment as consisting of a collection of separate tools and techniques. For example, if you want to do power optimization then you might use Power Top – an open source tool for doing power measurement. Conversely, if you want to do performance analysis – to count cache misses, branch mispredictions, memory fetches and the like – using performance monitoring counters, you might use another open source tool called OProfile. Intel also has a tool called the VTune Performance analyzer.
These tools show what performance issues are occurring on the chip, which in turn helps the developers to optimize their code. For example, if you see examples of high cache miss rates, you can investigate to see what portions of the code are causing this problem. This might mean that the data structure of the application needs to be changed to get better cache performance. Performance and power tools give the developer a means of getting valuable feedback from the hardware.
Most desktop application developers are well versed in Microsoft’s Visual Studio IDE. What tools are available for these developers as they move toward multicore applications?
Intel has the Intel Parallel Studio, which integrates well with Microsoft’s Visual Studio for multicore (parallel) code development. Parallel Studio is not targeted at embedded folks, but rather at the desktop client environment. Intel has tools that also help with the programming interface, compiler, libraries and more. With regard to debugging, we have a set of enhancements that integrate into the MS Visual Studio to help with parallel debug.
Debugging is a key issue. While developers could someday spawn 64 threads on a multicore chip – because they have 64 cores – that is not the best way to begin. In multithread implementations it’s best to start with one thread and make sure the program works, then we’ll move up to two and four, etc. Good debugging tools provide easy mechanisms to start debugging one thread then scale to more cores, i.e., they are serially consistent.
Another challenge in multicore development is in the area of configuration control. You may have multiple threads running on multiple cores, but you don’t want multiple version of the same code. Instead, you want one version of the code with a parameter that you can change that will change as your processor cores change. Again, good debugging tools have those configuration control features.
Tools work best when they are following a set methodology. You’re co-chair with David Stewart from CriticalBlue on the Multicore Programming Practices (MPP) Working Group – part of the Multicore Association for which Markus Levy is CEO. Please give the readers a quick update of the MPP.
As you know, our focus is on documenting best-known methods for multicore software development techniques. This year we are in the middle of documenting the best practices. Internal review of this document should begin shortly. We hope to have it ready for external review by the first half of next year.
This document will fulfill a need that is oftentimes overlooked, namely, what are the best practices using the technology that is available today. We have customers that are becoming more and more aware of the challenges of multicore software development. But we are still building awareness and educating the larger group of mainstream programmers. Even when mainstream developers identify the need for a multicore program, they are often stuck with their existing code. Not everyone has the resources, time or need to complete rewrite their legacy code. The MPP document will provide mainstream programmers with a workable set of best practices for multicore development throughout the typical life cycle development process: analysis, design-implementation, debug and the performance tune-up phase.
That’s the big vision. We just have to keep executing. This is all volunteer work, so it’s not something where I can say, ‘Hey, let’s meet this schedule in two weeks. You have to do it.’ Instead, we just have to keep the momentum going. I am pretty pleased with the progress. The feedback from our internal surveys is positive
The goal of your best practices working group is to use existing languages like C/C++ to develop multicore applications. Do you see the need to create new programming languages?
The Multicore Association has other working groups that are developing standards for new software approaches, such as those focused on multicore communications and runtime APIs. The overall plan is to incorporate best-known techniques using those APIs as we move forward. But it’s hard to predetermine the best-known methods before the APIs are available. We won’t know until we get there, but we can’t wait for one to proceed before starting the other.
Intel is working on several technologies both language extensions and new APIs so yes, there is a need for technology; I’m not so sure on new programming languages. In embedded, C and C++ are going to be with us for sometime, so I’d say there’s probably less need in embedded.
Hypervisors For Managing Power
Thursday, November 12th, 2009By Ed Sperling
Hypervisors are headed for a new role inside of multicore chips—managing the various power islands in addition to the cores.
A patent application filed by IBM, entitled “Method and system for hypervisor based power management,” shows the company’s intention to use hypervisors for everything from monitoring power consumption rates to scaling power for individual cores. http://www.faqs.org/patents/app/20080301473
In the well-documented history of hypervisors, this marks a major shift in direction. Hypervisors have been used primarily for running virtual machines on a single or multiple cores and for directing applications to take advantage of one or more cores. In effect, they have worked like rudimentary traffic cops, scheduling software functions for processors, memory, logic and buses.
Adding power into the mix changes the basic concept in two fundamental ways. First, it means the operating system becomes less important in a multicore system because critical decisions about what gets turned on and off, how much power is assigned to different processors or other parts of the chip, and what gets prioritized are made by the hypervisor layer rather than the operating system. And second, it means getting chips out the door will become immensely more complicated because just thinking about all the possible permutations for verifying these kinds of systems makes your brain hurt.
“A hypervisor for low power management certainly can work,” said Marc Bryan, product marketing manager for Mentor Graphics’ Codelink products. “This is an extension into the SoC world and configurable IP. Software developers want middleware capability to control the power demands with the SoC.”
He noted this works in both multicore and single core chips and becomes particularly useful in chips with multiple configurable power domains, such as an advanced ARM processor that can contain 14 of those domains. But it’s also like building complexity on complexity.
“This definitely opens up a new set of challenges in design and verification,” he said. “You’re adding complexity in the power domain. The challenge is verifying it. You have to make sure the hardware switches on and off and that the software is included. And with power management software, you have to make sure you can turn on and off the power domain and that the software works correctly with the hardware.”
This is no simple feat. In fact, to the best of anyone’s knowledge, it has never even been attempted.
From the beginning
The concept of a hypervisor has been around for decades. IBM introduced the first implementation back in the 1970s with its System/370 mainframes as a way of virtualizing applications running on the mainframes to make them more efficient.
Fast forward to 2005 and that same technology showed up in the eight-core Cell processor, which IBM created with Sony and Toshiba. Sony used seven of those cores for its Playstation3, plus a hypervisor to manage all the cores. It was the classic example of smaller, faster and cheaper compared to the complex multi-million dollar mainframes that were the size of multiple refrigerators.
Almost simultaneously, the same general concept began showing up to manage virtual machines in virtualization software created by companies like VMware and Citrix, which allow multiple operating systems to run on a single core or multiple cores. They also allowed multicore servers to be utilized at greater rates than the average 15% to 20% that many were being run at, costing both power to run the machines and power to cool the server racks.
Using hypervisors to manage the power itself, however, is new and shows the resilience of this concept of adding programmable controls for functions that typically have been handled by hardware.
“The hypervisor is a way to really start giving us control over power in SoCs,” said EDA consultant Gary Smith. “Put that together with an NoC and you really start moving toward an ESL view of power.”
Market realities
It still could take years before this concept shows up in power management of SoCs, however. While there is a compelling need to simplify power management on chips, this may not be the only approach or even the best approach.
Right now, many of these functions are assigned to the operating system. It’s possible that the operating system can start offering these kinds of capabilities rather than a hypervisor, or that a more robust hypervisor will be built into operating systems.
But hypervisors, at least in IBM’s view of the world, have a distinct advantage. In IBM’s model, the hypervisor runs between the metal and the operating system, almost like an enabling set of middleware. The result is that it can take advantage of whatever changes are made to the hardware and whatever hooks are added much more quickly than those changes can be added into the operating system, where backward compatibility of applications is vital. (See Figure 1)

Figure 1: IBM's hypervisor design
“One problem with doing power in the hypervisor is in the area of security,” said Barry Pangrie, solutions architect for low power design and verification at Mentor. “If you’re creating a medical device and you put more into the hypervisor, that means the hypervisor layer now has to be certified.”
The flip side is that the hooks in the hardware are going to be much more readily available to a hypervisor layer built for a specific purpose than an operating system. “When you’re talking about dynamic voltage frequency scaling, for example, those capabilities tend to run well ahead of what the software guys are using when they write their code. One way to deal with that is to make the OS smarter and use some of the statistics dynamically to help bring down the power levels.”
Another way is to develop new code that wedges between the hardware and the operating system, which is one of the models now being considered in the virtualization world. But when this gets to market and in what form is unknown. What’s interesting is there is a need and a method, and from here anything can happen.
Designing Systems For Power And Throughput
Friday, September 25th, 2009
By Ed Sperling
The most energy being consumed inside of processors is no longer for computation. It’s stuff that’s most chip designers think about after the design is completed, such as communication inside and outside the chip, managing those communications and the power levels across the chip.
Research from Intel Labs, unveiled at the Intel Developer Forum this week, show that for a supercomputer to achieve performance of 1 teraflop—one trillion floating point instructions per second—it now takes 200 watts for communication, 150 watts for memory to feed it, 100 watts for the computation, 100 watts for the external disk, 1,500 watts for control, 950 watts for the power supply and 2,000 watts for heat removal.
These may seem like enormous numbers compared to what are used in even communication base stations, and they’re orders of magnitude higher than many consumer devices. But the ratios are relevant even for consumer devices (minus the heat removal, in most cases), said Nash Palaniswamy, senior manager for throughput computing in Intel’s Data Center Group.
“The commercial world is all about balance,” said Palaniswamy. “You get the maximum you can from multiple cores. If you look back 15 years ago, algorithms could not work across cores, so it made communication impossible. Now we’re able to take advantage of multiple cores.”
At least that’s true in the supercomputing space. In the consumer world, many applications cannot be threaded or parallelized beyond a certain point. Intel has been focusing on a concept called balanced computing, which means that all the pieces in the computer function at the same rate so there are bottlenecks. For example, it doesn’t pay to put in an advanced component just because it’s available if the rest of the device won’t run any faster or better.
John Gustafson, a fellow in Intel Labs, said the new focus is on communication across systems. “It’s painful,” he said. “The cost per use is in the communication, not the wires.”
What’s particularly interesting is this is the way the human body works, Gustafson said. The majority of energy in the brain is spent on communication, not on processing.
“Things like larger cache allow the design to save power because it’s better to stay on chip than go off chip,” he said. “Right now, we’re spending about 10% of the power on communication and 90% on computation. In the future, we’ll be spending 90% on communication and 10% on computation. For all intents and purposes, floating point is now free.”
Writing Software For Low-Power Systems
Wednesday, April 15th, 2009By Ed Sperling
Almost any discussion of software in low power systems these days involves some sort of multicore approach.
That is particularly true at 90nm and below. At 65nm, unless there is a very distinct purpose for a low-power single-core device, it probably is utilizing at least two cores, and at 45nm the numbers can continue to rise, depending upon how many functions the chip is being used for and how important processing power will be.
For developers used to working in the symmetric or asymmetric multiprocessing world, where single-core processors arranged in arrays within the same device and tied together by middleware and very fast connectors, moving everything inside a chip actually makes low-power design simpler. In the SMP or AMP world, it was impossible to turn processors on and off. That’s already standard practice in multicore chips, which is a more controlled environment for running software than the multiprocessing world.
But designing software for multicore devices requires a lot more up-front planning than back-end work-arounds to really save power.
First of all, it’s important to note up front that not all applications can be parallelized to take advantage of multicore, and of those that can very few can be compiled once and scale to more cores as they become available. It’s a great concept, and multicore chip companies like Intel and IBM say great progress is being made, but there’s a whole other group that will counter with, “Don’t count on it.”
Moreover, multiprocessing was optional for applications. At 65nm and below, multicore chips are the norm. If software can’t utilize more than one core, the other cores are useless.
Second, multicore can mean many things. In a system on chip, it typically involves heterogeneous cores. In a processor, the cores are generally homogeneous. Writing software that takes advantage of many cores requires a multiprocessing operating system and applications that can be run in parallel. In an SoC, the software can be divided up by function using everything from a multiprocessing operating system like Linux to real-time operating systems that are written for a very specific function.
“The real trick is that if you break up an application, you have to do it at the modeling level,” says Irv Badr, Rational senior product marketing manager at IBM. “Breaking it at the source-code level is very difficult. If you break it at the modeling level, it’s as simple as pushing a button. You want the ability to move things around by asking ‘What if?’ That is very important. You also need to make sure when you’re modeling that the software isn’t coupled to the hardware. Some hardware can be used by a lot of software.”
More problems, more tools
A number of tools have been created to help migrate existing software to multicore architectures. The most recent is Prism, which is made by Critical Blue. It allows developers to analyze and explore code changes to take advantage of multicore hardware doing everything from dependency analysis to recalculation of the scheduler on multiple cores.
“The software guys didn’t ask for multicore,” said David Stewart, Critical Blue’s CEO. “But the only way we’re going to get more performance is if the software guys react.”
Intel, meanwhile, has created its own programming language to migrate applications to multicore architectures. Known as Ct, the language helps to parallelize applications that can run in parallel. The key to working in this type of environment is understanding the application well enough to know what can be split off and run on multiple cores and what cannot—and how much overhead there is in pulling the pieces back together for the user.
Ct isn’t the first language to attempt to ease the burden of parallelization instead of sequential software development. Software engineers who have been working in the multiprocessing world for awhile say it probably won’t be the last, either.
In Europe, a consortium known as eMuCo, for the embedded Multi-Core Processing for Mobile Communication, is taking a different approach by developing a standard platform for future mobile devices based on multicore architectures. The stated goal is to develop the controller, operating system and application layers. Members include ARM, Infineon, Telelogic, GWT-TUD, as well as four universities.
Promises, promises
If all of this can be made to work, there is enormous upside from both a performance and a low-power perspective. In devices such as a smart phone, for example, cores regularly are put into sleep mode. That can extend the battery life from hours to days, and in some cases even weeks.
Already, work is underway that teams up some unlikely partners. ARM’s Cortex controller is being combined with IBM’s Cell processor, for example, in a 60-core deployment on multiple chips, said IBM’s Badr. He said that in the enterprise, multicore can reduce power consumption by a factor of three, which allows blade servers to run three times as long because they run cooler.
But there’s a catch, too. While there’s money attached to making it work right this time, the problem has been studied for decades without major breakthroughs. The jury is still out on just how many cores is enough and how many is too much, and which software will work in what configuration. But given the realities of physics on a piece of silicon, there will be at least some multicore headaches in every programmer’s future.

