Taming The Multicore Beast
By Ed Sperling
Multicore chips are here to stay. Now what?
That question is echoing up and down the ranks of tools vendors, design engineers, software developers and even among people who measure the performance and efficiency of semiconductors. There is now a Multicore Expo and a Multicore Association that includes a who’s who of electronics. And there are lots of working groups developing different strategies to tackle this Hydra-like creature that has befuddled the best software minds in the world for four decades.
Why multicore?
Multicore was firmly on the horizon for chipmakers when they hit the 130nm process node. By the next process node, they realized, it would be impossible to turn up clock speeds without cooking the chip. For all intents and purposes, classical scaling—gaining performance at each new process node—ended at 90nm. The solution was to add more processing cores at lower speeds, and hand off the burden to software developers to fix the problem. After all, it’s hard to argue with the laws of physics.
This explains why most major university computer science departments now are dedicating a significant portion of their research to solving the conundrum of how to program multiple cores. The problem is interesting and the payoff can be huge to anyone who solves it.
It also helps explain why Intel invested $218.5 million in VMware in 2007, which is a safety net for utilizing more cores on a chip. If software can’t be developed to run on multiple cores, at least multiple instances of an operating system or multiple operating systems can run on the chip using virtual machines. Intel is adding “turbo mode” to its upcoming chips, though, which allows more of a chip’s total horsepower to be utilized in bursts on a single core if the application demands it.
Designing multicore chips
What becomes painfully obvious as you descend from 60,000 feet on the multicore world is that one core is not necessarily the same as the next. It can be. There are homogeneous cores in semiconductors made by companies such as Intel and Freescale, and there are heterogeneous cores in systems on chip, and sometimes there are both in SoCs.
While it’s easier to design a chip with homogeneous cores—you simply develop it once and then figure out the best way to share memory and bus traffic patterns—that approach isn’t nearly as efficient for a multifunction device such as a smart phone. The reason is that every application requires a different amount of processing power, and assigning the maximum to each one isn’t an ideal strategy.
In the embedded world, ARM has taken a first stab at this problem with its ARM11 MPCore multicore processor, which can be configured for one to four cores.
And to simplify building of the chips, all of the major EDA tools vendors either have or are working on multicore elements to their flows. Mentor Graphics has been working in multicore debugging with its Seamless products, Synopsys has added multicore for verification, implementation and manufacturing, and Cadence has added multicore support for virtuoso. Expect to see more announcements from these and other vendors over the next few months, as well as virtual prototyping solutions and faster simulation.
Where’s the application software?
So now that the tools to make the chips and follow them through the verification and manufacturing are being prepared, what’s next?
The next piece is application software, and most of the code that has been written in the past has been written using a serial approach. There is no easy way to compile that onto multiple cores, although there are tools to help.
Criticalblue just introduced its Prism tool to help parallelize legacy code. While you still can’t push a button to make it all work, and you can’t rework applications for two cores and have them fully take advantage of 32 cores, this kind of tool is a step in the right direction.
Another important piece of the puzzle is mapping the software to the interconnect. PolyCore has developed a middleware layer and tools to do that, distributing functions to different cores—something that is vital in multicore topologies, where shared busses and memory create problems that never existed in single-core chips.
Finally, Virtutech has developed a simulated environment for multicore applications with its Simics tool, creating what-if scenarios for applications.
But all of these tools still don’t produce the kind of volume of new applications that can be scaled across many cores. Sven Brehmer, president of PolyCore, said the gap between hardware and software is larger than it has ever been—and it will take years to close that gap again.
“There is a broader group of developers using multicore but they don’t know how to develop software yet or they don’t want to spend money on this problem,” Brehmer said. “There is no magic bullet here, but the open source community sees a need to simplify multicore. We’ve solved a portion of the problem but there’s a lot of work to be done and it has to be done at a pace that works for software developers. You can’t go from two to six cores overnight.”
But at least there is an incentive. “With all the potential monetary rewards, something will come out of this,” said Markus Levy, president of the Embedded Microprocessor Benchmark Consortium (EEMBC).
Hype vs. reality
When multicore programming gains critical mass is another matter. For all the talk about multicore initiatives, the reality is that there has never been a consistent industry effort to making multicore approaches work. And the problems of parallelizing software in the past have been confined to a small circle of computer science researchers at universities and at companies like IBM and AT&T. There has never been a massive effort to solve the problem because for the most part it didn’t have to be solved.
Making matters even more confusing, it’s hard to get a straight answer about what’s real in multicore and what isn’t. Just because software can run on a multicore machine doesn’t mean it runs faster on four cores than on one. In fact, some software may not take advantage of more than one core even though it will work on a four-core processor. “There are a lot of companies taking existing stuff and putting a new label on it and saying it’s multicore compatible,” said Levy.
What has worked exceptionally well in the multicore world are applications that can be parsed into specific pieces. Graphics and video rendering work particularly well, for example. Imagination Technologies, a U.K.-based IP vendor, builds scalable multicore graphics engines that parallelize the computing below the application level, Tony King-Smith, vice president of marketing for the company’s technology division, said during a keynote at the recent Multicore Expo.
“We can parallelize from 1 to 4 pipes and beyond, and we can multicore 1 pipe to 64 cores,” he said. “But to do this, you have to get the architecture right. If you get it wrong, you’ll spend too much effort on overhead.”
Freescale has taken a similar approach with its multimedia DSP technology. Kent Fisher, chief systems engineer for Freescale’s networking and multimedia group, said the big decision for his division is whether to use more smaller cores or a few larger cores. “It depends on your application,” he said. “And until the software tools catch up to the hardware, frequency and infrastructure per clock will continue to matter.”
He noted there is a problem in multicore power specifications, as well. He said that not everyone specifies power the same way.
Splitting the atom
From a software application perspective, there are several challenges that need to be considered. First, there needs to be a proper balance between splitting up different functions and splitting those functions into too many parts.
“What you really need to do is find the relative load of each function of an application,” said PolyCore’s Brehmer. “If you have a computation, you may be able to duplicate that on multiple cores. But you also have to look at data dependencies, because you can’t break a function out if it depends on data from some other place. Otherwise you’ll just be waiting for that data.”
That’s at least a major step toward understanding the resources that will be needed on a chip, which works well with homogeneous multicore systems. The next step will be better utilization of heterogeneous cores, which will require an understanding of application functionality all the way at the chip architecture level. It doesn’t make sense to have the same level of power for all parts of an application if those pieces are not identical in importance or the amount of processing that’s required.
And finally, some software development may be done with much thinner layers of an operating system—or even direct execution into the metal—as multicore SoCs become more integrated into system-level design.
The promise is better performance and ultimately lower power consumption, but it’s going to take time, committed effort of engineers and scientists, and collaboration from groups that in the past have never spoken the same language. Multicore is also multidisciplinary, and that’s a whole different problem to solve.
Tags: ARM, Cadence, criticalblue, Freescale, Imagination Technologies, Intel, Mentor Graphics, multicore, Polycore, Synopsys, System-Level Design, Virtutech, VMware













March 28th, 2009 at 1:54 am
‘Multicore chips are here to stay.’
I doubt it. They are a just one in a long line of flawed attempts to solve real problems, and this is obvious when you look at just three issues.
First, multicore implies shared resources, and shared resources strangle performance; as you add processors, more time gets lost waiting for shared things and the extra processors give less and less benefit.
Second, multicore is an example of the false mantra, unique in computing as it’s absent in all other engineering disciplines, ‘more visibly complex means better’. In order to solve the never-ending sequence of problems created by multicore, designs become increasingly complicated as ever more hardware is added to try to remove the inevitable bottlenecks. No wonder there is, and will always be, a huge problem trying to program the beasts.
Third, for obvious short-term marketing reasons, the lure is to make a silk purse out of a sow’s ear: to make random legacy code run fast with minimal changes. This is reasonable as a stop-gap measure, but does anyone honestly believe that in even ten years we shall still be writing huge monolithic lumps of code and expect them to be magically broken apart to run on multicore monstrosities?
The Emperor really does have no clothes, and engineers who can see this have to stand up and say so. Multicore will follow dead-end ideas like bubble memory and capability systems in oblivion, to be replaced by simpler, real solutions: non-shared multiprocessor systems and coherent multiprocessor programming models.
March 29th, 2009 at 3:24 am
In the early 90′ies we were studying the use of transputers and the multicore language OCCAM.
I would like to get an update on the status of the use of OCCAM in multicore environments?
And what happened to the transputers?
March 30th, 2009 at 7:37 am
The transputer was most emphatically NOT a multicore device and OCCAM was not a multicore language.
The transputer was aimed at distributed multiprocessor systems that avoid the bottleneck of shared-memory. It was the right way to go but failed, in my opinion, because it fell into the complexity trap, was outpaced by speed improvements in standard processors, and wasn’t invented in the USA. Also, it was hamstrung by the belief that you had to use OCCAM, a language that was foreign to the majority of potential users (regardless of its benefits or drawbacks).
March 30th, 2009 at 11:11 am
Peter, I go back and forth in agreeing and disagreeing with your view on multicore. I think multicore has some low-hanging fruit. Graphics rendering and databases are two examples. For most other applications, though, there are some serious problems that have to be solved, and which haven’t been solved over the last four decades. Are there more resources being thrown at the problem now? Yes. Will that solve the problem? Answer unknown. I think the bigger issue, which will be the subject of a future story, will be that most software is no longer scalable. You can’t write it and expect it to run on the next generation chip, using the OS to sort out everything in between. It won’t scale that way. Moore’s Law took care of that at 90nm, which is the same time that classical scaling ended on chips. Interesting coincidence.
March 30th, 2009 at 11:20 am
“It all depends on what your definition of ‘multicore’ is”. The term ‘multicore’ is very sloppily defined. To some people, it means just homogeneous SMP style multicores such as Intel, AMD, IBM and SUN are all selling, primarily for desktops and servers, although with ARM MPCore we may see more in consumer portable embedded devices (and MIPS and others also have SMP multicore devices). To those of us of a more MPSoC or multiprocessor bent, the term ‘multicore’ (viz. the Multicore Association and Multicore Expo) can also broadly include AMP kinds of MPSoC or multiprocessor systems. Pull out any cell phone of the last decade and find the DSP+Control Processor (usually an ARM) in it – heterogeneous Asymmetric MultiProcessing. Pull out a smart phone or cell phone of more recent vintage and possibly find an audio application processor, a video application processor, etc. – “MultiASIP”. Most of these devices are designed as heterogeneous multi-processor devices and the issues of dealing with legacy SW are not as onerous – add a processor that can run legacy control SW and power it down when not needed. And remember the portable and low end consumer products – it is NOT just all about large servers and desktops plugged in the wall. The imperative to save energy tends to lead portable appliances towards more heterogeneous multiprocessor solutions.
April 3rd, 2009 at 2:34 pm
OCCAM was a parallel processing language (so you could call it a multicore programming language), however it was extremely restrictive and a pain to use – I wrote 12 lines or so to boot my C code, and had the fun job of translating it into VHDL (while working at Inmos). The underlying concepts of CSP are very applicable to multicore.
I don’t think the early transputers (say T800) could be described as overly complex (the T9000 was).
Generally hardware development seems to run ahead of software methodology, so while there are plenty of multicore platforms, there isn’t a good common way to program them. So many (like the transputer) die before folks have worked out how to use them.
Whether it’s AMP or SMP the beast will not be tamed until the software methodology gets ahead of the hardware. Plain old C/C++ isn’t going to cut it.
April 3rd, 2009 at 2:55 pm
I tend to agree, multicore is NOT here to stay. You’re replicating the whole core and strapping on a shared L3 and various arbiter technology. That suggests it’s a nascent technology. Instead of replicating the whole core, more advanced technology would seem to want to have a multi-fetch, multi-issue, multi-decode, multi-ALU, and…I don’t know what you do with the L1. That would arguably bring us right back to superscalar architecture again. Except this time, instead of trying to multi-issue a single instruction stream (the bottlenecks of which motivated us to go multicore in the first place), we take on all the multitasking, multithreading, asynchronous processing, etc. by replicating all the individual units, rather than entire cores. Maybe ALU_3 wants to get its operand directly from a latch on FETCH_0? That’s faster than waiting for Core_0 to writeback its result off to shared memory somewhere. The finer level of granularity is harder, more advanced; but it lets us make optimizations at multiple levels previously not possible.
April 6th, 2009 at 5:49 am
“I don’t think the early transputers (say T800) could be described as overly complex (the T9000 was).”
Exactly my point. Instead of concentrating on tracking technology with the simple devices (T4/T8), Inmos decided on a redesign for the T9 and added layers of complexity that led to delays and inefficiencies.
“The underlying concepts of CSP are very applicable to multicore.”
That is true, but the fundamental problems with multicore (by which I mean multiple processing units on a common memory) will hit you regardless of the programming methodology.
April 6th, 2009 at 10:13 am
Shared memory is really the problem. The popularity of the SMP architecture comes out of having used different processes for memory and CPU chips, and now most of the performance problems come from having to shift data between the memory and the processor. For scalability and efficiency you really want to bury the processing in the middle of the memory so that you don’t waste energy moving your data around. But as I said above the software methodology isn’t there to support Processor-In-Memory (PIM) architectures – but I am working on it
April 7th, 2009 at 7:37 pm
Coherent memory is not actually a requirement of multicore systems as much as it is of particular software paradigms. AMP systems don’t usually need it the same way SMP systems do. So I’d say multicore is definitely here to stay, but current (SMP) software methodology is on its way out.