Published on June 25th, 2008

Implementing Multi-Core: The Devil Is in the Detail

Processor designers have known for some time that the days of the traditional uni-processor architecture are numbered. In retrospect, it seems that the 50%-per-year performance increases achieved throughout the 1980s were cheaply won. Straightforward process shrinks yielded more, faster transistors, that could do the same job with lower power dissipation.

But since 1999, the picture has looked rather different. Moore’s Law has been supplemented with Pollack’s Rule (named after Fred Pollack, another distinguished Intel veteran), which suggests that performance has increased in proportion to the square root of complexity (doubling performance requires four times more transistors).

So, while shrinks are not coming around any less frequently, their beneficial effects on uni-processor architectures are becoming limited. Despite progression to 65nm and 45nm geometries, clock speeds have topped out at around 3GHz. At the same time, device architects have begun to find it more difficult to use the extra transistors on offer. The instruction-level parallelism of traditional superscalar architectures has run out of steam.

This matters because power management has become a major headache. Modern chips contain so many transistors that it is a tough job to deliver enough power to keep them all doing useful work. Then, the designer is faced with the reverse process: having squeezed more transistors into a smaller area, moving the resulting heat (10W/cm2 and higher) off the device is increasingly difficult.

As a result, chip architects have turned away from uni-processor architectures. This trend is already in evidence on the desktop, where dual- and quad-core processors are becoming the norm.

Such innovations, however, barely scratch the surface of the potential of multi-core systems. The multi-core approach can indeed restore the link between performance and complexity and deliver both better MIPS per Watt and MIPS per dollar. But to do so requires very careful choice of array size and core complexity—and tool support that makes the programmer’s task as easy as coding for a single processor is absolutely critical.

In fact, it turns out that two-, four-, or eight-core arrays really miss the point. Making a process “a little bit parallel” can provide performance benefits, but only with significant penalties in terms of complexity and usability. Sixteen- or 64-core arrays are a good start, but it is only when the system scales up to hundreds of cores that the true benefits kick in.

An array of this size improves manufacturing reliability via a redundant approach in which any faulty blocks can be permanently disabled. A similar process, implemented dynamically at run time, implements implicit power management: Inactive blocks can be temporarily shut down. If well-designed, multi-core also delivers significantly better performance with growing array size, so process shrinks become valuable once again. And data and instruction storage can be localized, removing another of the major bottlenecks in modern applications.

Perhaps most importantly, a large array allows the choice of an optimally sized processing unit that will vary with the target application. But a “Darwinian logic” suggests a complexity closer to that of the Intel 8086 than that of a Pentium 4. Analysis at my own company, focused on wireless signal processing applications, suggests that a 16-bit, three-way, long instruction word processing unit with a three-deep pipeline running at 100MHz to 200MHz delivers about the optimum in terms of MIPS/mm2 of silicon or MIPS/mW. Such a structure allows array sizes of several hundred cores.

With all these elements correctly designed and in place, system architects can turn their attention to perhaps the most important element of all: A good multi-core architecture must be as easy to program as a single-core device.

In principle, such usability is another bonus of the multi-core approach, which provides a very natural model of abstraction for the many millions of transistors. Analogous to object-oriented programming, each processor in the array can be treated as an independent, encapsulated block. Design, verification, and validation all become easier when dealing with such independent sub-units and hierarchically scaling up to system level.

In practice, however, usability is more about familiarity than utility. The multi-core system must let the designer use “standard” languages—in practice, HDLs for hardware and ANSI C for software. There must also be access to robust tools and a familiar development environment.

Multi-core architectures are undoubtedly the way forward. But the devil is in the detail: large arrays, not small; small processor units, not large; and, most important of all, the provision of programming tools “usable by engineers other than the processor designer” to make real systems from the concepts.

Peter Claydon is the co-founder and COO, picoChip, a venture-backed, fabless semiconductor company based in Bath, England.




Sponsored by




©2017 Extension Media. All Rights Reserved. PRIVACY POLICY | TERMS AND CONDITIONS