For the past several years, a flood of multicore processors has entered the market touting the capability to break through the performance barriers of clock rates and power consumption. Driven by the inexorable march of Moore's Law, processor companies seem to believe that the only way to utilize the projected billion-transistor silicon budgets is to add four, eight, or even hundreds of sequential processing engines. They task the compiler teams or software architects with finding ways to harness that performance.
Driving this movement to multicore architectures is the fact that making bigger single-processor systems has provided diminishing returns, as expressed by Pollack's Rule. That rule dictates that the performance increase is proportional to the square root of the increase in transistors where complexity can be the number of transistors or die area.
Doubling the complexity of a processor therefore provides only a 1.4X gain in performance. Yet a 4X increase in complexity is needed to gain just a 2X performance boost-not to mention the increase in power consumption that's needed for this complexity.
Clearly, just increasing the size of the processor wasn't the answer. While the march to finer process geometries provides some increase in overall processor frequency, that return also is diminishing. A new approach was needed and Intel, the leading supplier of general-purpose CPUs, recently unveiled R&D projects in multicore architectures. Not surprisingly, the aforementioned Pollack's Rule came from Intel.
This limitation of multi-processor systems was first described by Gene Amdahl in the late 1960s. Amdahl's Law states that the amount of performance one can expect from multi-processor systems is limited by the percentage of serial code in the application (with n being the number of processors):
S(n) <= 1/(Serial% + (1-Serial%)/n)
This equation creates a logarithmic scale of diminishing returns. Say, for example, that only 20% of your code is serial. To get a 3X performance boost, you need six processors. (After six processors, you get diminishing returns.) This is marginally better than Pollack's rule, in which a 6X increase in processor complexity would yield a 2.4X performance improvement. Over the intervening decades, a number of bright engineers and engineering teams have tried to overcome Amdahl's Law. Yet none of their schemes have gained widespread acceptance.
The reason for this lack of adoption is lack of success. That lack of success has been related to software. Still, new multicore or multiexecution unit architectures keep appearing on the scene. But they continue to be plagued by software woes.
If we look at the high-performance compute architectures in use today-be they supercomputers, wired/wireless infrastructure, or video-compression systems-we see a much different approach. Instead of trying to scale multiple processors or digital signal processors (DSPs), these "real-world" systems use a combination of a sequential processor (or two or three) and either field-programmable gate arrays (FPGAs) or custom silicon. The FPGAs or custom silicon accelerate the critical compute functions that normal sequential processors are unable to efficiently handle. These hybrid systems don't just provide the required processing power. They actually consume less power and cost less than if they had been implemented as multi-processor systems. The industry adopted this approach because it has learned- through hard experience-that it is unable to effectively harness the promised power of multi-processor systems.
It isn't that these hybrid systems are easy to design; they're not. In order to use this approach, companies must go through the painstaking task of converting their algorithms into hardware descriptions (RTL) and architect their own processor/coprocessor systems. But the message is clear: Implementing algorithms in fabrics or custom silicon provides better compute performance than multi-processor systems.
At Stretch, we believe that the future of high-performance embedded computing is using a combination of a RISC processor with a reprogrammable fabric embedded directly within the processor's datapath. The ability to custom-define accelerated instructions using C/C++ eliminates the need to describe algorithms in RTL. In addition, re-using the fabric sequentially as the processor executes programming threads provides significant performance increases without having to turn to esoteric, massively parallelizing compilers or multi-threaded approaches.