Posts Tagged ‘multicore’

Next Page »

Low Power Drives New Architectures

Thursday, September 8th, 2011

By Pallab Chatterjee
Power became the driving discussion at several major events last month.

The global cries for energy reduction, which have been mainstream since the early 1970s on the political level, have now moved to being real economic realities for component and systems suppliers. Chipmakers are finding that lower power makes good economic sense—lower cost of packaging, lower cost of ownership of the products, higher reliability and, most importantly, the differentiation in power reduction methods is resulting in a lower cost of sales for the products as it is increasing the customer retention.

Once a methodology is selected for the chips, it is carried through to the board, then the system and eventually the software that runs on it. This makes the cost of changing the power method very expensive and typically keeps the customer on multiple generations of hardware and components from the same suppliers under the same software umbrella.

The Hot Chips conference featured several dramatic network and multicore server products that all had enhanced power management. The power management formally was multiple rails (I/Os and cores) and sometimes a thermal shutdown. The new systems are pervasive to the point that architectures are created with equal attention paid to power management and data throughput. The features shown were multiple power supplies, variable power voltages, block-based shutdown and turn-on, new circuits to minimize turn-on/turn-off, alternate clock tree distribution systems, lower power PLLs and clocks, and even new logic methods.

Fulcrum presented a 1 billion packet/second frame processor, which ended up being a case study for the applicability of non-synthesized sequential logic or asynchronous design. The logic structure, while known in the past, has never been implemented in such a large-scale application before, and the results included not only better performance but a power envelope that was task-acceptable.

Similarly IBM, Intel, Tilera and Cavium presented next-generation many-core designs with performance targeted at application needs over the next 5 to 10 years, but with power profiles at levels similar to chips of many decades back. The general rule is that power per transistor in these designs is less that 100 times what it was five years ago.

On the system side, data centers are the driver. Dell addressed the issue of power reduction for its servers by not just swapping components, but also re-qualifying the systems to work at extended temperature ranges. This means peak air temperature can be as high as 113 degrees Fahrenheit (45C) for its servers without sacrificing performance or warranty. This increase from 80 degrees Fahrenheit means there is no need to provide chilled air to cool the machines. The cost of the environmental air is generally equal or greater than the cost of the energy to run the servers.

To keep the component power down, these servers use new 30nm DDR3 DRAM from Samsung, which are now down to operating at 1.35V from 1.8V. The reduction in the power supply, and the reduction of geometry to make the devices, provides higher performance, higher density and an overall reduced power envelope. Google has noticed that by using virtualized machines and high DRAM on its servers it can eliminate the power from rotating media and go to mostly high-memory machines. This architecture systematically drops power at the data center level by double digit percentages and provides an increase in performance. The performance increase allows for the implementation of new features such as “instant search” while a user is inputting the full search field.

Facebook, which is new to the game on hardware, took a fresh look at power and started not with the chips, the memory or even the board, but with “how is the power getting to the computers?” It was able to provide a 12% to 15% reduction in power by looking at and redesigning the power supply input (408V to 24v signal path) and eliminating the UPS in its servers. This is a new area of high-power and high-current design that companies need to think about and look at. Facebook also ended up changing the board designs for the base compute server modules. Information on the Facebook approach and other areas to address the power can be found at OpenCompute.

Power as defined by the EDA community, which is “dynamic peak power in active mode,” as well as in idle mode, multi-mode and transition, and even infrastructure, will all play key role in next-generation low-power design.

Experts At The Table: Multi-Core And Many-Core

Thursday, August 11th, 2011

By Ed Sperling
Low-Power Engineering sat down with Naveed Sherwani, CEO of Open-Silicon; Amit Rohatgi, principal mobile architect at MIPS; Grant Martin, chief scientist at Tensilica; Bill Neifert, CTO at Carbon Design Systems; and Kevin McDermott, director of market development for ARM’s System Design Division. What follows are excerpts of that conversation.

LPE: How does cloud computing change the need for multicore and many-core processors?
Sherwani: Cloud architectures will evolve differently from mobile architectures. They will be homogeneous 8-, 16- and 32-core architectures. They knows a lot about what you are storing. You can put a lot of intelligence into what you’re storing, which is not the case in a mobile device.

LPE: So what does that mean for the mobile devices taking advantage of it?
Sherwani: It can certainly make mobile devices more efficient. You can store a lot more on the mobile devices. You can do a lot of streaming.
Martin: The application cloud interaction may change in character. People will write somewhat different apps in the future that will take advantage of what the cloud has to offer. This is why you’ll see cobwebs on the desktop in the future because no one is very interested in it anymore.
Sherwani: And if you look at video, with the cloud and a good wireless connection you don’t have to store the video. Video cameras will become a lot less expensive.
McDermott: This should be put into context. It’s amazing that people are so excited about a database. That’s all it is. I believe the vision for the mobile device is that you have access to all the data, and you selectively choose how to expose it. The browsing experience is different. You don’t try to replicate the desktop experience on a smaller screen. It’s a given. You take the appropriate content and you display it in a way that’s easiest to digest. I think the hardware on the mobile device will become smart enough to selectively show you the piece that you need on your mobile device. You don’t need an entire map. You just need to know where you are.

LPE: What’s interesting about databases, though, is that they’re one of the very few applications that really can do true parallel processing and scale effectively.
Sherwani: I’ve been saying for the last two years that we should stop giving people content. In five years all the content will be available. If you’re a mechanical engineer, everything you need will be on the Web. What we need to do, though, is teach people how to do something useful. This is the same thing with mobile devices. Whatever device will be useful will be the one that can quickly filter through what you’re looking for to get something done. It’s not about storing more information. Cloud brings that opportunity to people, devices and things. Our view of expertise will change. It won’t matter if you’re an electrical engineer. It’s whether you can get a task or series of tasks done. That will be more important than a Ph.D. We are 10 years from that, but this is how people of the next generation will think.

LPE: What you’re talking about is data mining for the masses?
Sherwani: Yes.
Martin: Before we get too carried away, there are a couple of issues that really need to be solved in this cloud paradigm. We do need to think a lot about privacy, security, and the ability of the infrastructure—both wired and wireless—to deliver all of this content off the cloud and onto the sea of mobile devices. We all know about the experiences of certain smart phones overloading networks and they’re still trying to improve the quality of the network. The wired infrastructure is not fault free. Security and privacy worry me more. If you upload all your data into some big infrastructure, you want your data secured.
Rohatgi: That’s the weakest link. Everybody’s pushing down this path. What worries me is the security and reliability. There are a ton of issues that need to be resolved. Creating a smart infrastructure for data mining can be done today. On the mobile side, there are probably some advances necessary to improve battery life, which is the No. 1 complaint I hear today. But the weakest links we hit are the communications channel, security, privacy and reliability. If those can be resolved then we can progress.
Martin: The technologies we’re all involved with are going to help in a big way. It just requires a bit of mobilization to focus on those issues.
McDermott: This reminds me of where we were with cell phones years ago when the processor went through certification with the carrier. The consumer doesn’t see all the certification on the network. The carrier loves new features. It’s more traffic for their store. It brings in a new wave of users. What they don’t want to see is something that disrupts their infrastructure. For the engineer, the certification is really intense and the field trials are difficult. The cell phone industry has to show a partition that you can certify your baseband and your protocol stack and that has to be isolated from other activity. That underlying security infrastructure is built into the certification. I think we’ll see that extended upward through commercial transactions to having trusted processes and transactions.

LPE: Will cores all be homogeneous or heterogeneous, and will some of them be virtualized?
Sherwani: All of the above. There will be homogeneous cores, heterogeneous cores and there will be virtualization. They all solve different problems. You need virtualization in data centers.

LPE: But will you need virtualization on your smart phone?
Rohatgi: We’re starting to see some of that. I don’t think the operating system wars are dead. And at the end of the day, there is some value to keeping RTOS access to legacy hardware and a high-level operating system like Android or Windows or IOS. From a security angle, it all depends on the use case. The mobile guys are really scared of virtualization of a single processor that has access to all memory. They want separate memory and separate everything.

LPE: This is similar to devices that have a partition between what’s used at home and at the office, right?
Rohatgi: Yes. It’s the same problem. And this almost ties into virtualization. On the privacy side, there isn’t a well-defined security layer with NFC (Near Field Communications Forum) and they’re talking about mobile payments. If you power on an Android phone and shut off all networking then your maps go haywire. Why? Because there’s a back channel that goes to some cloud that helps triangulate where you are. That information is stored to help applications of the future. I’m surprised people aren’t bothered by this. But to return to the question, we’re starting to see some effort down the path of virtualization even though it’s not widespread yet.
Martin: You won’t see virtualization down to the metal. In the dataplane layers it’s nice that processors can emulate other processors effectively, but close to the metal you want extreme efficiency and high performance.
Neifert: And that’s where I see the problem with virtualization. It’s the power. Virtualization is nice, but it’s an abstraction away, which is a power loss. At that point you need heterogeneous processing.
Rohatgi: Transmeta, about nine years ago when they started doing abstractions to hardware, had power numbers that were way down. It’s too bad that green energy wasn’t something that was important then. Still, the genesis of the Atom processor was entirely because of Transmeta..
Sherwani: A typical Bluetooth radio takes about 32 milliwatts of active power. At 65nm we have a Bluetooth radio that only uses 3.2 milliwatts. And there is a design on the board that will take it below 1 milliwatt. There are a bunch of engineers getting excited because over the last 100 years the basic design of a radio has not changed. What Marconi designed is essentially the same as we have today. But when you scale down the power needs to go down. It’s amazing how much lower you can go.
Rohatgi: There’s the other side of this, too. Battery technology has not evolved as much as we would like. For the analog components, it’s the switching characteristics that are governing it. That’s where you’re seeing a lot more intelligence. If you were to look at the power profiles of a mobile device, LEDs and LCDs were supposed to be the promise for low power. That hasn’t worked out. There are still 250 milliwatt drivers. The radio is probably No. 2 on the list after that.
McDermott: People’s expectations were that a screen would be a certain pixel density. Today that needs to be super high-definition. It’s beyond high-def.

LPE: So will we see more cores in the future or have we maxed out?
McDermott: As a programmer, how are you going to keep track of 100 cores? How are you going to program that intelligently? Either it’s going to be some array a programmer can visualize, or it’s going to be three or four very solid cores and let other cores do things like Bluetooth. You can’t keep 100 threads in your mind.
Rohatgi: There’s a limit to this. If you look at the desktop space, in 2006 when Intel began heading out on this multicore approach they found that success wasn’t nearly as fast as they thought. There’s probably a limit on mobile devices, too.
Sherwani: We did all this in the 1980s. nCube used to have a 16-core and 32-core machine. It works great up to 8 cores, but after that you lose it.
Martin: If you are trying to program a concurrent application and split it into different threads, there are inherent limits. Some very specialized applications may be very concurrent, but most are not.
Neifert: The programming model has a human in the center, and humans can only process so much. Until the fundamental programming model changes, you won’t see much advancement.

Experts At The Table: Multi-Core And Many-Core

Friday, July 29th, 2011

By Ed Sperling
Low-Power Engineering sat down with Naveed Sherwani, CEO of Open-Silicon; Amit Rohatgi, principal mobile architect at MIPS; Grant Martin, chief scientist at Tensilica; Bill Neifert, CTO at Carbon Design Systems; and Kevin McDermott, director of market development for ARM’s System Design Division. What follows are excerpts of that conversation.

LPE: Is software taking advantage of the hardware in a power-efficient way?
Rohatgi: Yes, and the ultimate example of that is the Android operating system. Even though it relies on Linux there are on-demand and five levels built into Linux that controls at the software level the CPU registers or SoC registers to shut down power. You’re already seeing that at the operating-system level.
Martin: It depends upon which software you’re talking about. At the OS level, where lots of apps are running, there may be commoditization happening. Down at the dataplane, where people use application-specific processors, you can argue that’s the infrastructure. People want extreme power efficiency and reliable continuously executing functionality. That’s the place where heterogeneous multiple processors really shine. It’s almost an infrastructure layer in a mobile device. So you see different solutions depending on what level of the device you’re talking about. We see a drive to more heterogeneity, too. Baseband wireless infrastructure works better with heterogeneous processors than trying to shove that onto a multicore device.
Neifert: That’s certainly what we’re seeing in our customer base. They want one processor to run the modem subsystem or the WiFi and partition that off. The last thing you want to do is wake the application processor all the time. The application processors are getting more complex so you can talk and play games at the same time and surf the Web. The application processor has to handle all of that. The application processor may be power efficient, but not as power efficient as one that just runs the radio or data transfer.

LPE: Is it better to actually design a device with multiple processors or a single multicore processor?
Sherwani: When I was at Intel we believed it was the best processor ever developed. I never thought I would see ARM and x86 processors on the same device. We are not that far away right now—and I’m talking about having them on a single chip. Or it may be a MIPS or Tensilica core. Such processors will exist. We are very efficient these days about using power islands. We can put six or eight processors on a chip and we can put them to sleep when they’re not being used.

LPE: Is it more difficult to verify them?
Sherwani: The verification nightmare is growing exponentially, and it’s not clear to me how we will be doing verification five years from now. At the implementation level, verification is becoming a bigger and bigger piece. But it’s more of an architecture question than whether you’re using multicore or many cores.
Martin: This whole approach tends to lead to a more compositional design style where you’re composing well-understood systems. What you need to do is limit the interactions between them to a relatively high level of abstraction or control. You verify significantly each subsystem and then you verify without having a great deal of interaction between the subsystems.
Sherwani: It’s amazing that on a big chip people don’t do flop-to-flop timing on a block. This is a situation that would never happen in software between subroutines, but it happens all the time in hardware. In hardware we have not reached a maturity level where I take care of my block and you take care of your block. We have timing paths going to two blocks and you cannot time it unless you do the timing and verification together.
Neifert: I’ve got customers that will spend months validating their processor, fabric, memory and data path, throwing out all the various options on there and running that. That could be a single-core processor reaching out to memory, and they’ll spend a lot of time optimizing that. Now throw in one other master accessing the same memory and everything goes out the window because of all the different permutations when these things talk to each other. It now blows up exponentially. The nice thing about a multicore approach is that you’ve handed off a lot of that task to the processor guys and hope that they’ve done it properly. It may not be the optimal use for your application, but pushing the problem off to an IP provider and a multicore solution is what a lot of our customers are doing.

LPE: What’s the best way to take advantage of cores? Do you do it with Wide I/O or through multicore and a standard bus?
Sherwani: If you look at where Micron is going with this, the whole interface has been changed. The memory becomes a lot more intelligent instead of a dumb storage. You will be able to ask memory to do certain tasks. Processor people have tried to make memory as dumb as possible in order to commoditize it. All the value comes from the processor side. But balancing would be better so you can offload things. You can combine flash into the most cost-effective memory. Instead of saying, ‘Give me byte No. 7,’ you can say, ‘I need this piece of information.’ It’s a lot more power-efficient to do it that way.
McDermott: It’s quality of service. You’re not just making a data request. You’re saying, ‘I need high bandwidth or high efficiency or low latency.’ A processor may need only a small amount of data, but it may need it very efficiently and very fast. With video you need high bandwidth that is very predictable. Having graphics integrated is one way to go. Unless you have a view of the fabric, the quality of service and the end power engine it’s going to be very hard to engineer a one-point solution.
Martin: With a compositional approach, you may have big memories and then a lot of small distributed memories to keep data close to the area where it is being processed. And maybe you need some intelligent abstractions on things like DMA (direct memory access). That would give programmers more assistance in managing the data flow and data interaction so things will move out of central memory into local memory before they’re needed. That’s a different programming style. We need more flexibility in how hardware and software developers can compose these memory systems together.
Sherwani: If memory is knowledgeable about what is stored inside, it can give you service of the highest level. Right now you can’t do that. The attitude has been, ‘I have a board and I have a DIMM and I want this DIMM to be as low cost as possible.’ That approach has led us down this path. If you’re designing a microprocessor of any kind, it puts a lot of burden on the microprocessor to do all these things with memory. Eventually you will see memory microprocessors—storage with a processor on it—that can gate what is being stored on it. That is a new area, though, and I don’t think much has been done so far.
Rohatgi: In some respects this is already happening. If you think about cache controllers over the last 30 years, this is where you’ve seen a massive improvement. It isn’t user-level aware. It’s bit-level aware. And if your memory isn’t fragmented it works. Or in a multicore design, a coherency module is also very well aware of what it needs to do to keep synchronization between processors. I like the visionary statement of making it user-focused.
Neifert: If you look at the various SoCs on the market, they may use processors from ARM, MIPS and Tensilica, but a large number of them are still doing their own memory controllers because that’s a place to differentiate their design. There are more memory controllers coming out of Synopsys and Cadence, but in large part the bleeding-edge SoCs are still designing their own.
Sherwani: But you can go a lot further.
McDermott: There’s a big difference if you can optimize a path for video and have some pre-fetch algorithm. That may not apply to every chip. But in a custom design, you can partition as needed. When you define your coherency space you need to make them aware of these choices. It’s not just an arbitrary memory spec. You need to make them aware of how to use it.
Martin: That should lead to some opportunities for much more sophisticated memory control, and the kinds of data flows and accesses that people really want to do. That can be reflected in configurable memory IP. I’m not sure how rapidly that’s happening, but there are moves in that direction.
Sherwani: For the work we are doing with the [Micron] Hybrid Memory Cube, there’s a lot of excitement around that space. A completely different level of system design is possible with that kind of hybrid model.

Experts At The Table: Multi-Core And Many-Core

Thursday, July 21st, 2011

By Ed Sperling
Low-Power Engineering sat down with Naveed Sherwani, CEO of Open-Silicon; Amit Rohatgi, principal mobile architect at MIPS; Grant Martin, chief scientist at Tensilica; Bill Neifert, CTO at Carbon Design Systems; and Kevin McDermott, director of market development for ARM’s System Design Division. What follows are excerpts of that conversation.

LPE: Computers aren’t getting the power/performance boost today from multiple cores because the software can’t take advantage of them. How do we fix that?
Martin: Your computer isn’t a place where all the advanced design techniques are used. You have to look at battery-powered, cordless devices to look at the places where people use the most advanced design techniques. There they very often will have specialized application processors for different parts of the applications they want to run on those devices. Those processors are designed to be energy-efficient and to efficiently use battery power, and they probably do work better from one generation to the next—except for the case where they may throw on additional general purpose processors and don’t take advantage of energy consumption. You have to get a big distinction between multiple processors that are application specific vs. general-purpose processors that do not offer efficiency or better performance.
Rohatgi: Once the Intel-AMD megahertz wars ended people started heading down a different dimension of multicore. Back then they believed that changing the software ecosystem so that specific software or systems could be written to take advantage of multi-core, multi-thread, multiple processor designs would actually work. We’ve seen it work in many cases. You can reduce the latency when you’re executing a certain process or multiple processes. Another twist to this paradigm is people use core islands. The operating system may run on one core while another core is used for acceleration. Some people define that as multi-core, and that has been very successful because you can partition between a media processor engine, a video processor engine and a graphics processor engine. In terms of power consumption, that whole element needs to be pieced into this picture. When it comes to embedded SoC design vs. desktop design, those are very different when it comes to power consumption. That element hasn’t been worked through very cleanly on the desktop side, where suddenly you need 800-watt power supplies.
Neifert: The overall user experience that people have when interacting with a device has moved from the underlying hardware to the software. The emphasis has shifted to enhance the user experience. Opening a window on your desktop used to be simple. Now there’s shading and fancy graphics, so the same window that used to come up in 5 instructions may now take 500. It looks a lot nicer and in some cases that changes the user experience. But from the processing side, the focus stopped being on single-thread performance as the megahertz started burning up too much power. They branched out into multicore to solve that, but changing the software to accommodate that has been a big struggle. Changing the hardware to isolate that properly has been a struggle, too. Some of the processing that been done on computers is difficult to migrate over to mobile devices. A lot of the innovation on the desktop is now taking place in the embedded space. If you want to see the leading-edge design techniques, that is where you have to look.
McDermott: In the mobile area low power is associated with the battery life and the key to the user experience is maintaining functionality throughout a working day. We’ve gotten to that point. Now we’re engineering more productivity. There are more features you can run, more capabilities, more graphics, but still within that working day. Now what we’re seeing is low power is key to other markets. Data centers are predicted over the next few years to rival the airline industry for energy consumption. Cloud computing will lower the power a node, but that energy is still being used somewhere even though it’s shifted. What cloud changes is that if you run an application on one device and shift to a different device it’s no big deal. It takes advantage of the underlying computing architecture. There also may be a hierarchy of operating systems to deal with it, depending on the device.
Sherwani: We got very interested in how power relates to multiprocessing. If you are trying to predict power within a watt or two that’s no big deal. If you are trying to predict power within a milliwatt, that’s very difficult. We thought that by looking at implementation of the netlist we could predict power. That turned out to be not the case. Then we tried system-level design. That doesn’t work. We finally came to the conclusion that you have to have a user model. We needed a human model—a businessman, a lawyer, a student—and then analyze what they did during the day. Then we had to convert that into system level and then RTL level. This takes us far from what Open-Silicon does as a company, but we have found this the only way to accurately predict power. These kinds of human models don’t exist. We created two models of two types of people who use it. Then we started recording real human beings and calculating the model against them. Good models don’t exist if you want to accurately predict power.

LPE: Are we better off with many cores or multiple processors?
Martin: Multiple heterogeneous processors are the way to go, particularly in the mobile domain. With clusters of servers you may have many homogeneous tasks you want to map. The desktop is a bit of the orphan here. If you move to cloud computing and the highly mobile devices and ever-smarter phones, you wonder if people will worry about even having a tethered desktop. That means the innovation may be in the big server farms and the mobile devices, and the desktop may gather dust.
Neifert: It will be replaced by a docking station that you plug your mobile device into.
Martin: That’s right. Or as we have seen, some companies are combining mobile devices and a laptop together. The use cases are extremely interesting because there is no single use case. For a mobile device that has an advanced graphics processor, the game player may burn up battery by hammering that all the time. The music lover may be using MP3 decoding and get significantly longer time out of the battery. That drives significantly different use models and processor choices.
Rohatgi: There are a lot of different vertical markets. It ranges from digital still cameras to anything with a battery. There is a use case for multiple processors. Networking and cloud computing are very large markets. In the embedded space, what has happened is there are a lot of people in the SoC space. The hardware itself is heavily commoditizing. Even the operating system is commoditizing. The differentiation is how you pick and choose your IP. If it comes down to cost in a mobile phone, from the top up they don’t have a feature list or a use model. The discussion begins with, ‘What can you fit in a 7 x 7?’ Based on something like that, what kind of IP can you fit in there and still have a useful device? In the volume mobile phone market, the direction is to shrink the die as small as possible. It may be a 6 x 6 or a 5 x 5. In that case, I would choose multicore rather than multiple processors.
McDermott: In cell phones the issue used to be standby and talk time. People could self control that. If you talk more your battery goes down. People are starting to experience that if you want to play games you have to deal with this. We’re starting to deal with the apps developers. You used to have specialized OSes and applications. With the proliferation of open source you don’t know what could be running on there. It can run any app. We’re reaching out to the app developer to write code that is attentive to the power effects. There is an amazing learning curve through people writing a good game experience in a power budget that’s acceptable. You need to get the apps to be power-efficient.

How Software Utilizes Cores

Thursday, November 4th, 2010

By Ann Steffora Mutschler
When writing software, how does the design engineer determine how much power it will draw on a particular targeted platform? While the question seems straightforward, the answer is not.

The industry is just starting to develop the ability to get some data in that space,
according to Cary Chin, director of technical marketing for Synopsys’ low-power solutions group. “And when we can do that, then I think what you’ll find is mobile applications will actually be written differently than the ones you run on a laptop because they’ll be better optimized for power and may do things differently in terms of how you cache data.”

Getting to that point isn’t simple, though. Jason Parker, operating systems architect at ARM, said power-efficient software needs to be part of the design from the start. “Designers need to constantly ask themselves, ‘Is this the most power saving way of solving this problem?’ Trying to retrofit power management and efficiency into an existing design is hard work, and all the silver bullets were used up a long time ago. Multiprocessor designs open up additional techniques and constraints for power management.”

Understanding what happens below the surface is a start. Threads and processes are the software abstractions that represent CPU execution and the visible memory space. A thread represents the execution state of the CPU, e.g. program counter, registers and flags. A process is the constrained process memory space for one or more threads to execute within with the MMU used to provide this, he explained. There can often be more than one thread in a process, and they all share the same data.

In a single-core processor, the CPU is shared between the threads by the OS kernel scheduler, execution is managed by the scheduling of threads, determined by the thread priority and time slicing and switching threads is known as a context switch. In comparison, a multiprocessor (MP) combines multiple high-efficiency CPUs together that can deliver greater aggregate performance for less total power than a single high- performance CPU, and provide more power management options, Parker noted.

MP systems are divided into symmetric and asymmetric systems. “Asymmetric systems can have different OSes running on different cores working together to provide the whole system solution. An example would be a smart phone that has an ARM CortexA8 application processor for the Android user interface, and a different Cortex R4 processor running the real-time telephone stack in the RF modem, and additional cores for graphics, video and low-power audio. The advantage of these systems is the processors and resources for each subsystem can be tailored to deliver the expected performance at minimal power. The disadvantage is the system architecture is often fixed and may not be able to implement a future requirement, e.g. new video format.”

Meanwhile, symmetric systems run a single OS kernel across identical cores with a coherent memory system joining them together, Parker explained. “SMP OSes will run multiple threads simultaneously, aiming to share the workload over the cores within the cluster. Well-structured code and algorithms, that are parallelizable, are able to harness the performance of the multiple cores. Existing code and serial algorithms may not be able to take advantage of multiple cores. Power management systems within SMP OSes will control power consumption by scaling performance on the cores using DVFS, and will turn off unused/underused cores.”

Today’s complex SoCs contain a mixture of SMP and AMP subsystems, with power optimized for their respective tasks. For example, “a multicore Cortex A9 system provides the flexibility for an open-platform OS where the future application requirements are not known, whereas the CPU requirements for an LTE modem are known at design time,” he said.

Attaining optimal core utilization
But just understanding how the system is structured is not enough. To achieve the best utilization of cores by the software certain techniques should be implemented, keeping in mind that core utilization is driven by the subsystem partitioning and the further parallelizability of system code and algorithms. “The OS scheduler can maximize execution efficacy by keeping threads and their data on the same or local CPUs while application software can force this by the use of thread affinity,” Parker said.

Maximizing core utilization will drive maximum performance. However, it may not be the most power efficient solution for every silicon process, particularly those with the power management to optimize thread scheduling when the total required software load is a fraction of total performance. For example in a dual-core system where the total load is 80% on one CPU, key questions to ask are:

1. Does the kernel run one CPU at 100% performance, with the second one turned off?
2. Does the kernel run both CPUs at 50% performance, with lower frequency, voltage and total power?

In addition to subsystem partitioning there are other ways to optimize how software utilizes cores, but it depends on the tasks at hand, Parker said, including consolidation of multiple OSes onto a onto a single CPU or cluster using a hypervisor. Also, many instances of a virtualized OS can be distributed over many cores using virtualization, such as in the case of Web servers. At the other end of the scale, embarrassingly parallel problems can be handed over to a GPU, using Open CL for example in image processing.

“In the middle is where things are interesting,” he said. “How does an existing system scale across many cores? This is a 30-year-old challenging problem for performance, and more recently the power cost. Using threads is a workable solution for existing code and a few cores (less than eight), but they are hard to program. Measurement and analysis, as ever, are the engineering skills required. Without a very good understanding of your system it will be hard to make good use of multi-core hardware.”

When to use multicore
Everything is headed in the direction of multiple cores today, said Synopsys’ Chin, “As the frequencies on processors are continuing to be pushed up, that pushes technology further and further and makes the power problem worse and worse. The idea of trying to increase throughput or increase processing capability by duplicating cores to either dual-core, quad-core, hex-core or many more in some processing units has been the path that most of the processor manufacturers have been on. People have been talking about that for the last 8 or 10 years.”

“As a result, we see lots of processors—Intel Core i5, Core i7 kinds of processors with four and six cores pretty mainstream today and very interesting, although the architecture in mobile electronics hasn’t really gone that route yet. I’d say it’s more the idea of heterogeneous cores where you are using specific cores for more specific tasks. In a mobile application there is even more demand for optimizing the processor capabilities to the specific task at hand,” he noted.

Some applications do better in multicore environments than others, however. “The big difference between the kind of performance improvement you’re going to see with regard to a server farm versus a mobile device is that on a server farm the applications like virtualization, databases, and Google searches are algorithmically well parallelized and can be threaded easily. When you’re in a cloud or server farm environment you also have the benefit of having many, many users which provides another level of parallelization and capability with the overall farm,” Chin said.

In those environments, it makes sense to parallelize and have as many cores as possible because the whole idea of starting up the farm is to raise utilization. “The idea is to have your farm running at close to 100% utilization if you can, 24/7, whether that’s with online finance applications or Christmas ordering seasonally. And you want that to be balanced with usage from other parts of the world.” he continued. “With a mobile application there’s only a certain amount of threading you can do in the OS and in the applications that you want to run. On something like a smart phone the idea isn’t to have it running all the time. In fact, the idea is the opposite. You want it running as little as possible.”

Performance Plus Lower Power

Thursday, October 7th, 2010

By Pallab Chatterjee
Power and performance often have been seen as something of a tradeoff. Chipmakers focus on one or the other, or they extract a little improvement in both at each new process node.

That way of thinking is changing, though. At the recent Linley processor conference, the central theme for both standalone and embedded processors was that architectures have to optimized for power management and performance. Historically, performance and application code execution were the two lead design parameters. All of the processors shown now have as one of the primary design constraints a power management method and a design partitioning that supports selective-block power down.

One of the most anticipated presentations at this show came from Tilera, which presented its new architectural fabric for dramatically improved multicore processor designs. Its new technology features a bus and interconnect architecture for connecting tens to thousands of cores on a single die. This new processor family is optimized for power efficiency on a performance-per-watt metric. In its designs the number of cores is the new Megahertz factor.

The power efficiency of Tilera’s design (up to 200Tbps on-chip with its 2-D mesh network) is based on using short wires and locally optimized CV2f. Designs for the Tile-Gx family, exploit the capabilities of locally available and distributed L1 and L2 caches, distributed memories. In addition, the use of custom OSes allow for the localized power up/down of not only the processor cores, but their local unused memory block. This method provides for close to linear scaling of the processor and power consumption based on the workload sent to the device.

Applied Micro presented its PACKETProc processor family, which is a high-speed network processor simultaneously optimized for security, concurrency, availability, power management and determinist behavior. To maintain security in all states the processor features distributed cores and localized state machines functions. For the power management, this includes standby power modes, the recently ratified IEEE 802.3az-2010 Energy Efficient Ethernet controller, Dynamic frequency scaling that uses individual control over each core in the design, and smart I/O that supports “wake on LAN” and low-power polling/support for WoX, USB and GPIO. This architecture is scalable from one to many embedded cores in a design on a single SoC.

Netronome presented a clarification on the new paradigm for application software and hardware processing of data traffic. As data traffic increases due to the prevalence of video and mobile data, peer-to-peer is no longer going to be the most voluminous data source. This large data-sized traffic (video is based on sustained packet flow, not single burst point-to-point data passing) is driving the server community from its base 10G infrastructure to 40G and 100G. These higher-bandwidth systems are based on “flows” rather than “packets.”

A flow is defined as a unidirectional sequence of packets all sharing a set of common packet header values. They are generally a common criteria found in 2-tuple, 3-tuple, 5-tuple, 7-tuple and 10-tuple groups. The 10-tuple form is the base of the Open Flow specification. These flow-based processors require a different power management methodology as the cache-flush cycles, and hence the power-down cycles, are different from packet-based processing.

In a flow-based design, the data is not spatially dissociative between the general-purpose CPU and the cache memory. Instead, it is distributed over the group of cores and caches via a load balancer. This removes the memory latency issues and stalls due to cache and CPU misses associated with power-down of cores between packets. The flow processors are aware of the upcoming data strings due to the header commonality, and adjust the power management accordingly to minimize memory and data latency. This method also allows for multichip threading in addition to in-die multithreading.

A New Reference For Low-Power Processors

Thursday, September 9th, 2010

By Pallab Chattejee
Just how much power can you squeeze out of a processor without destroying performance?

Ask IBM. The company introduced a new methodology for power and energy management on its multicore processor chips. The new PowerPC chip, the Power 7, has eight main processor cores each with its own L2 and L3 cache and two central memory controllers. The architecture for the design is built around an energy and power management schema called EnergyScale.

The EnergyScale system is a data-dependent, policy-based system that interprets activities in the processor cores, the memory hierarchy and the main memory. It is made up of four distinct parts: Sense, Decide, Control, and Actuate. The sense function is performed using both digital-thermal sensors (DTS) and critical-path monitors (CPM). The DTS utilizes 44 on-chip sense points that are organized as five per chiplet, emergency self-protect thermal throttling, and on the main memory controllers. The CPM detects circuit timing margin to help guide the optimal frequency and voltage adjustments.

The decide block is an off-chip, dedicated-function microcontroller that gets its information on the status of the chip though an EnergyScale I2C Slave communication port. To assist in the performance of the EnergyScale microcontroller, the system minimizes the communications bandwidth by packing the sensor data to reduce the number of read operations, multicasting the responses to reduce the number or writes and creating an automated on-chip transaction table which allows the sensor data to be streamed out in a single I2C command.

The control block features per-core frequency control ranging from -50% to +10% of the nominal frequency, on-chip support for off-chip voltage control, memory power management, and a command rate interface control. The core frequency control, in order to minimize latency, has an automated fast frequency slew of more than 50MHz per microsecond. The voltage control is done through a serial voltage I2C command interface, and is fully automated based on the policies that are defined. The memory management includes power-down modes for the DIMMs and also reducing the data access rate as needed. As the Power7 chip is an symmetric multiprocessing (SMP) system, and has SMP based memory interfaces, the command-rate interface control was built with asynchronous control to be as adaptable as possible while addressing the needs of any core chiplet.

The Actuate function uses three different power-down modes beside the normal operating mode. These modes are per-core, and are based on both levels of power reduction and latency to return to full function. The modes are “Nap,” which targets about 5 microseconds of latency to return to operation, and is structured on turning off the clocks to the execution units; “Sleep,” which features 1 millisecond of turn-on latency and which has the clocks shut off while also purging the local caches; and “Heavy Sleep,” which has a 2 millisecond target recovery time. In this mode, all the cores are in “Sleep” mode, and the voltage is reduced to all the cores, caches and the states are loaded into low-voltage retention registers. The exit from heavy sleep includes an automated voltage ramp back to full operating voltage as the hardware is automatically initialized. These energy policies are in addition to the per-core frequency scaling, and the associated core voltage scaling that goes with the frequency adjustment.

In addition to the direct sense, the firmware of the off-chip microcontroller can estimate functions based on the data coming in to adjust energy for leakage, temperature, and power supply variation. The last portion of intelligence for the energy-control system is the CPM. The circuitry dynamically detects margin in circuit timing and eliminates the potentials for static conservative margin guard-banding in the active designs.

The net result is more than a 50% improvement in the power for the individual cores as a system package using the automated on-chip controls and the off-chip microcontroller firmware based signal loop (as shown in the following figure).

Dual Core Embedded Processors Bring Benefits And Challenges

Thursday, April 8th, 2010

By John Blyler
The embedded processor market has now fully embraced the multicore world with the recent introduction of the dual core option for Intel’s Atom devices. Dual-core embedded processors offer designers many new benefits while presenting new challenges. How will the multicore option affect low power designs, virtualization, and single-threaded legacy software? Will these devices lead to more connectivity? Is the embedded processor market looking like the ASSP market of the future?

To answer these questions, Low-Power Engineering talked with Jonathan Luse, Director of Marketing for the Low-Power Embedded Products Division of Intel.

LPE: How does dual-core affect power consumption?
Luse: It’s best to think of the Atom as roughly split into two vectors–performance and power. The performance vector is a little less power constrained and a little more performance oriented, but still low power compared to Intel’s Core family of processor. The other major vector is low power. At the winter Embedded World Conference in Nürnberg, Germany, we introduced our entry performance level processors, which included the dual-core option at about 13watts thermal design power (TDP) to 5.5 watts for the single-core kit at 1.6kHz. This was designed to have a little more tolerance for power, with the expectation that Input/Output (IO) interface and performance would be increased over time.

LPE: Is there a target wattage for future embedded Atom processors? Low power competition is stiff, especially in the mobile markets.
Luse: The low power vector is a strategic imperative for Intel. But the low-power roadmap is a journey, not a destination. The minute that I have 5w products, then the 4w market calls me up saying, “You’re so close to our needs that if you just string another watt out, then we’ll start consuming your products.” But the minute that I have 4 w processors, then the 3w market will call me and on it goes. Ideally, you could go to the spaces below Atom, i.e., into the application-specific standard product (ASSP) chips and microcontroller spaces where power is measured in milliwatts. Strategically, I look at that as a direction to move, provided we get the performance and the technology challenges to match the low power goals.

LPE: Does scalability remain intact with the new dual-core Atom?
Luse: Yes, it’s completely instruction-set compatible up and down the processor chain, from the embedded Xeon to the Atom. Obviously, there are some advanced functions in the higher end processors that won’t be executed in the low end ones.

LPE: How about virtualization?
Luse: The standard utilization of virtualization remains applicable. Nowadays, the trend has been to allocate functions to a core as opposed to splitting virtual machines across the same core, such as trying to emulate a quad core if you have a dual core. Today, many discussions focus around the blending of real time operating systems (RTOSes), as well as traditional operating systems, using some virtualization techniques. The goal is to mesh applications that have been on physically discrete systems into a virtualized environment.

This goal comes from vendor sensitivity about their RTOS performance being adversely affected by a general purpose OS. Historically, mission-critical applications like safety systems have a real time, deterministic operating system that is physically separate from a supervisor type of controller. However, today’s customers are both form factor and cost constrained in their applications. This has encouraged designers to be creative in the way they use virtualization, such as with the blending of RTOS and OS applications. This is a virtualization phenomenon, not a processor one.

LPE: Let’s turn to the software side of design. How are legacy single-threaded applications being addressed?
Luse: The readiness of software in multicore systems is an ongoing challenge. Within embedded systems, you have a long history of vast lines of code that are all single threaded to run on a single core processor. Most programmers and their companies don’t want to recode everything just to make it multithreaded so it will run better on multicore systems. But these programmers do want to take advantage of extra processors. That is where virtualization techniques can increase the processor compute density, i.e., to take advantage of multiple cores in applications the use existing single threaded software.

LPE: What are some of the more interesting applications that you’ve seen?
Luse: There is no way to predict all of the innovative ways in which customers create applications. For example, one customer is developing a smart energy harvester that supplies power to a wheel bearing monitoring system in a rail car. This system monitors and manages the wheel bearing motion to make sure the bearings are solid. It’s powered by an indirect mechanism based on the motion of the rail car itself. Like the mechanism in a Rolex watch, the energy harvester uses a cantered pendulum that swings back and forth, thus powering the system. The battery mechanism is powered by the motion of the rail car!

LPE: Do you see any emerging trends in the embedded space?
Luse: Perhaps the biggest trend is toward the connectivity of embedded devices. The cost of embedded connectivity and intelligence continues to go down. The next move for devices will be a growing awareness of their surroundings. Consider Amazon’s Kindle. Today, it’s completely unaware if another Kindle is nearby. The next generation Kindle or similar devices may be more aware and will creative in the ways it utilizes that awareness.

LPE: Many connected devices require new sensors. Is Intel considering the addition of embedded MEMs and sensor in it devices?
Luse: The classic challenge is what to integrate and what to keep discrete. What type of sensors might be included? If you include those sensors in the die, then you affect the cost models. But that is the business challenge of the future. If you want those sensors to be close to the CPU, then you must add more specialization into the chip itself. It’s getting to the point where, in addition to general purpose CPUs, there will also be a market for more application specific features and derivatives that almost look like application-specific standard products (ASSPs). If I look at the ASSP market, it starts to look like the CPU market 10 years from now, i.e., the amount of processing horsepower that will be put into an ASSP is increasing.

The View From Intel

Thursday, December 10th, 2009

YouTube Preview ImageMax Domeika talks to Low-Power Engineering about the impact of power and how that is affecting everything from embedded to multicore software.

http://www.youtube.com/watch?v=BVoren-2N40

Power Optimization Drives Embedded And Multicore Software

Thursday, December 10th, 2009

By John Blyler

Max Domeika, senior software engineer in the Developer Products Division at Intel, sat down with LPE Consulting Editor John Blyler to talk about the growing importance – and intersection – of both the multicore and embedded markets. What follows are excerpts of that conversation.

LPE: Intel’s software focus seems to be following its hardware processor drive into both multicore and embedded markets. What challenges does that bring for traditional software developers?

Max Domeika: Coming from a background in the desktop software application space at Intel I’m now spending more time working in the embedded multicore arena. This year I’ve been particularly focused on power issues, primarily on the Atom processor. My task is to see what sorts of tools are needed to help developers move to both embedded and multicore applications. Already I see a long term need for both power optimization and power measurement tools. The key is to monitor the power-related impacts of your application on the specific and overall system performance.

In the past, desktop clients and server users haven’t had to pay much attention to power. Desktop systems plug into the wall and their software applications use as much power as the processor wants to give them. Over the past several years these processors have incorporated features that help control the amount of power usage in both C (idle) states and P (operational) states.

How do these processor states affect the development of software applications?

In the past, processors either ran at full speed or idle. Several years ago hardware designers added features to the chips to control how deep of a sleep the processors are in. As you know, these features allow different portions of the chip and caches to be turned-off. One of the challenges is that while deep sleep saves more power, it often takes more power to wake up.

For the software developer, this means that you don’t want the application to enter the deepest C-state if you will have to wake up immediately. There needs to be some smarts as to how deep a sleep you go into, which is really an operating system issue. P states, or operational states, utilize varying frequency and voltage to balance the amount of the execution that the OS determines is needed. These states directly affect the performance of the system.

These states can have a big impact on the application, restricting how developers write the code. If your application causes the processor to perform poorly, that will have a negative effect on power utilization. Developer need more mature tools to help figure out which application processes result in effective use of the C-states.

I see a need for continued maturity of the power optimization and power measurement tools and methodologies. These power tools must also be tied to traditional performance analysis and optimization tools, because many of the techniques for mitigating power are the same techniques that you use for traditional performance optimization. The entire system must run as quickly as possible while using as little power as possible.

At the other end of the spectrum are the same issues of power optimization and performance analysis, but applied to a multicore environment. Tools and methodologies need to mature to include multicore development, too.

Are the two worlds of embedded and multicore coming together? After all, Intel’s Atom isn’t yet a multicore architecture, is it?

Well, some instances of the processor already support hyper-threading, a technology that dates back to the Pentium 4 processor. The key here is hyper-threading, which makes the environment look like two (or more) processors from the point of view of the operating system. That’s why the software techniques that developers use on multicore are starting to have an impact on embedded applications targeting the Atom processor.

Isn’t the low power push also affecting the high-end embedded and multicore processors like the Xeon?

Power is important across the board. We’ve seen power optimization become important in servers – especially with the “greening” of data centers.

Let’s talk about the actual tool environment for both power issues and multicore design. What’s happening there?

Today, I’d summarize the tool environment as consisting of a collection of separate tools and techniques. For example, if you want to do power optimization then you might use Power Top – an open source tool for doing power measurement. Conversely, if you want to do performance analysis – to count cache misses, branch mispredictions, memory fetches and the like – using performance monitoring counters, you might use another open source tool called OProfile. Intel also has a tool called the VTune Performance analyzer.

These tools show what performance issues are occurring on the chip, which in turn helps the developers to optimize their code. For example, if you see examples of high cache miss rates, you can investigate to see what portions of the code are causing this problem. This might mean that the data structure of the application needs to be changed to get better cache performance. Performance and power tools give the developer a means of getting valuable feedback from the hardware.

Most desktop application developers are well versed in Microsoft’s Visual Studio IDE. What tools are available for these developers as they move toward multicore applications?

Intel has the Intel Parallel Studio, which integrates well with Microsoft’s Visual Studio for multicore (parallel) code development. Parallel Studio is not targeted at embedded folks, but rather at the desktop client environment. Intel has tools that also help with the programming interface, compiler, libraries and more. With regard to debugging, we have a set of enhancements that integrate into the MS Visual Studio to help with parallel debug.

Debugging is a key issue. While developers could someday spawn 64 threads on a multicore chip – because they have 64 cores – that is not the best way to begin. In multithread implementations it’s best to start with one thread and make sure the program works, then we’ll move up to two and four, etc. Good debugging tools provide easy mechanisms to start debugging one thread then scale to more cores, i.e., they are serially consistent.

Another challenge in multicore development is in the area of configuration control. You may have multiple threads running on multiple cores, but you don’t want multiple version of the same code. Instead, you want one version of the code with a parameter that you can change that will change as your processor cores change. Again, good debugging tools have those configuration control features.

Tools work best when they are following a set methodology. You’re co-chair with David Stewart from CriticalBlue on the Multicore Programming Practices (MPP) Working Group – part of the Multicore Association for which Markus Levy is CEO. Please give the readers a quick update of the MPP.

As you know, our focus is on documenting best-known methods for multicore software development techniques. This year we are in the middle of documenting the best practices. Internal review of this document should begin shortly. We hope to have it ready for external review by the first half of next year.

This document will fulfill a need that is oftentimes overlooked, namely, what are the best practices using the technology that is available today. We have customers that are becoming more and more aware of the challenges of multicore software development. But we are still building awareness and educating the larger group of mainstream programmers. Even when mainstream developers identify the need for a multicore program, they are often stuck with their existing code. Not everyone has the resources, time or need to complete rewrite their legacy code. The MPP document will provide mainstream programmers with a workable set of best practices for multicore development throughout the typical life cycle development process: analysis, design-implementation, debug and the performance tune-up phase.

That’s the big vision. We just have to keep executing. This is all volunteer work, so it’s not something where I can say, ‘Hey, let’s meet this schedule in two weeks. You have to do it.’ Instead, we just have to keep the momentum going. I am pretty pleased with the progress. The feedback from our internal surveys is positive

The goal of your best practices working group is to use existing languages like C/C++ to develop multicore applications. Do you see the need to create new programming languages?

The Multicore Association has other working groups that are developing standards for new software approaches, such as those focused on multicore communications and runtime APIs. The overall plan is to incorporate best-known techniques using those APIs as we move forward. But it’s hard to predetermine the best-known methods before the APIs are available. We won’t know until we get there, but we can’t wait for one to proceed before starting the other.

Intel is working on several technologies both language extensions and new APIs so yes, there is a need for technology; I’m not so sure on new programming languages.  In embedded, C and C++ are going to be with us for sometime, so I’d say there’s probably less need in embedded.

Next Page »