The Perfect Recipe

May 10th, 2012

By Chris Rowen
I’ve been working on logic synthesis and layout for almost 30 years, and the technology never ceases to amaze me. The core problem is a hard one: How do you take a high-level logic description, written in in human-comprehensible terms, and transform it into a near-optimal network of gates, realized in any desired semiconductor process?

Logic synthesis has evolved enormously over the years, applying ever more sophisticated transformations in the combination and sizing of gates in order to find common sub-expressions, map to target logic cells, reduce path delay and lower power dissipation. Layout also has changed radically, as complex standard-cell libraries, optimized for specific CMOS processes, have become the most common building blocks for complex logic functions. And all this operates under the unyielding requirements of strict functional compatibility with the original description, typically Verilog or some high-level language.

The logic transformations of logic synthesis act to chop up the logic into a uniform logic “puree” in which the original gates of the design are no longer identifiable. It’s like putting a tomato in a food processor. As clusters of gates are replaced by logical equivalents, most of the original signal names are lost—only explicit register state elements, usually flip-flops, have a chance of retaining any one-to-one connection to the original Verilog. On the other hand, this logic puree becomes a highly versatile ingredient in the SoC design kitchen. Multiple functional blocks can be stirred together and optimized to create better compound functions.

The radical de-structuring of the logic in synthesis creates great challenges in placement and routing of the cells. Even if the logic originally was expressed in a regular structured form, as many datapath functions are, that regularity is destroyed in the “puree” process. The job of placement is to discover or rediscover the optimal (x,y) topology for that logic, to meet the often-conflicting goals of area, speed, power, interface organization and block aspect ratio required in the ultimate full-chip design. Then the router needs to reconnect all those blocks within the space available—the routing channels over and between the logic cells—to make the circuit work again. It’s like trying to reconstruct the whole tomatoes again from the tomato paste. The results are imperfect, but still remarkably tasty.

As designs get bigger and bigger, the challenge gets worse. A leading-edge digital signal processor core may contain hundreds of thousands of cells, implementing more than one million basic logic gates. Designers must choose the best recipe. One recipe calls for decomposing the processor core into a dozen or more sub-units, which are pushed individually through synthesis, then placed and routed together. This method retains a degree of structure, but forgoes the benefits of optimizing the logic across the sub-unit boundaries. An alternate recipe calls for pureeing the whole core together, then relying on placement and routing to reconstruct the natural organization of the processor. This second recipe is more time consuming in the tools, but generally seems to lead to the best results.

Recently my team has been applying this second method to the latest version of our ConnX BaseBandEngine 64 DSP core. We faced a dilemma with the recipe. Even applying the most advanced synthesis, placement and routing tools resulted in a design with one small area with very high routing congestion. The density of wires reached a critical threshold where the required connections could be completed, but not with the expected timing. Some wires had to take “scenic routes” around the congested area. But what was causing the congestion? We couldn’t just look at the gates in that small region because all of the names had been lost in synthesis. We tried coloring the layout plots by major function unit, based on the retained names of their flip-flops, but the area of congestion remained a dark lump. Finally we devised a way to “taste” the lump and trace thousands of signals back to their associated flip-flops to learn why all those wires converged in this one place. We quickly identified an obscure and relatively small function unit that was defined with an excessive number of global connections. By making a small tweak in this little function unit, the worst-case routing congestion looks significantly better.

I love to cook, and processor-based SoC design sometimes poses similar puzzles. A standard recipe is a great starting point, but you also need to pay close attention to the taste and texture of what you’re creating. When you’re in the kitchen doing something new, you can invent new twists on the recipe to make things even more delectable.

—Chris Rowen is the chief technology officer at Tensilica.

Automation Is A Beautiful Thing

April 5th, 2012

By Chris Rowen
Design is fun. Design is hard. Design is important. Many of us spend hours, days, months, or years designing new chips, software, processors, systems or networks. Design is a process of understanding constraints and finding at least one feasible solution.

Sometimes the constraints are economic: “The chip, with packaging, can’t cost more than $3 to manufacture at a run rate of 100K units per month.” Sometimes, they are logistical: “We HAVE to have the design ready for manufacturing by the end of July 2012.” Sometimes the constraints are on technical parameters: “Logic clock frequency must be at least 1GHz in TSMC 28HPM technology, with power dissipation less than 500mW, including leakage.” And sometimes the constraints specify flexibility: “The design must be reprogrammable to handle the following five standards, plus any related standard that comes along in the next two years.”

The requirements often pull the design team in conflicting directions, and every real world design is full of compromises—a delicate balancing act of technical choices and subtle negotiations among champions for different parts of the whole. Experienced design teams, given the opportunity, get very good at pulling all the pieces together into an integrated design whose economy and versatility belie the compromises within.

One of the key tools of these experienced design teams is automation and streamlining of the design process. That automation may be in the standardization of the interfaces among blocks; it might be in the programming interfaces between applications and underlying services; it might be in the form of tools for “meta” design—design of the generalized solution, from which a specific solution is generated once the final constraints are known.

In recent months my team has been working on a family of high-end DSPs for wireless baseband communications and related high-throughput computation tasks. The crew recognized early on that marketing’s first set of goals and constraints was not likely to be their last, so we decided to build a DSP generator instead of a DSP. This DSP generator runs on top of tools that take processor configuration and language descriptions and spit out RTL, compilers, models, verification suites and RTOSes. The DSP generator is really building TIE descriptions from even higher-level requirements. It offers our design team a set of high-level knobs that provide an abstract view of all of the key characteristics of the DSP, including all the essentials:

• How many operations can issue in one cycle (as a Very Long Instruction Word bundle)?
• How many data elements are held in each register?
• How many computations are done in parallel in each operation?
• What operations are allocated to each operation slot?
• How many different instruction formats are supported?
• What optional DSP instruction set packages are included in the final DSP?
• What is the pipeline structure of the DSP, to balance operating frequency with core size?

Building the DSP generator, of course, is more work that building just one DSP, but it can be used for widely different end markets and with widely divergent technical constraints. The automation, however, also reminds me of something essential about the design process. At its heart this process is a team of individuals, often with strikingly different skills and personalities. No one person can do a complete design of this complexity. It takes the combined energy and the diversity of a team to pull off such a design. Design automation, done right, is not an alternative to conventional design. It’s a magnification or distillation of the team skills. And it makes design more fun.

–Chris Rowen is CTO of Tensilica.

Circling The Flat World

March 8th, 2012

By Chris Rowen
Some weeks I immerse myself in some new technical problem—ultra-low-power DSP architectures, new video processing algorithms, or multi-core programming models—but some weeks I’m on the road.

Right now, I’m halfway around the world, in the middle of a trip touching Barcelona (Spain), Pune (India) and Seoul (Korea). So much has been written about globalization that I hesitate use the word. We obsess on how new technologies—Facebook, smartphones, HD video—have spread to every geography and into everyday life. We jump to the conclusion that this must make our lives all alike, too. The rapid pace of technology change belies the more gradual pace, but inexorable impact, of social transformation.

The spread of the Internet does carry common language, entertainment and youth style. On the other hand, it doesn’t often change our mother tongue or who we tend to marry. These tap into deeper and more conservative wells of culture. But the spread of “world culture” does sink in. It often changes our second language. It shifts our aspirations for our children. It carries us into much greater personal contact with one another across oceans, class and experience. That contact often makes our differences more vivid. The face-to-face meeting with an eager new Indian college grad shifts my perspective more than any statistics on education in the developing world. And over a generation, the contact triggered by technology can deeply influence even our most basic sense of community.

Mobile World Congress 2012 in Barcelona opened a window into a potent new round of basic technologies that can only reinforce globalization:

  1. LTE-Advanced, the next step in 4G wireless was everywhere. LTE-A offers peak data-rates of 300Mbps downlink, and is likely to routinely deliver 40 to 50Mbps to the average smartphone user. It is still a couple of years away from real deployment, but wireless basestation providers are starting to deliver LTE-A-ready systems and UE (User Equipment, aka handset and data-card) silicon makers are starting to promise multi-standard 3G/LTE/LTE-A solutions.
  2. The shift from smartphone-as-device to smartphone-as-application-platform is nearly complete. Nokia still showed a remarkable 41 Mpixel phone as symbol of the fading notion that a cool spec makes a compelling story, but the message of Barcelona was the blurring of the boundaries among smartphone, tablet and PC. These are distinct devices, but increasingly they are one platform. This was best shown in the Android booth—really the Android expanse—that was all about the application platform.
  3. The partner to the application platform is the user experience. More and more, that experience is highly visual. Multi-stream wireless HD video, social network-linked photography, gesture recognition, speech recognition and advanced image processing are all becoming integral to the mobile experience.

There are still unsolved technical problems here, of course. Perhaps the most pressing one is how we’re going to make all these high-bandwidth experiences in networking and multimedia into rich application programs, within the constraints of tiny batteries. These problems require at least 10x to 20x the computing horsepower of today’s best mobile applications processors, so just scaling up the clock (and the power) is out of the question. The opportunity for new classes of audio, video, imaging and wireless baseband processing looks unbounded.

The challenge of Barcelona—and Pune and Seoul—is not whether we can keep up the pace of technology development. The question is how we will all leverage it to change lives in positive ways, and use it to gain deeper insights into the differences that make us so interesting to one another.

–Chris Rowen is the chief technology officer at Tensilica.

Thinking Different

February 9th, 2012

By Chris Rowen
Last night, I turned the last page on Walter Isaacson’s biography of the late Steve Jobs. Isaacson’s masterful work does remarkably well in getting behind the myth to an honest and insightful look at the extraordinary man and his extraordinary creation—Apple. Isaacson brings out the three defining characteristics of Job’s approach to everything:

  1. Intensity of focus on the goal: Building the best possible product, even sometimes at the apparent expense of caring about the people who are building it
  2. Independence of thought: Looking at fundamentals, especially how to build what the user didn’t yet know that they wanted, without concession to prevailing wisdom on engineering, business models or management style
  3. Deep commitment to aesthetics: As Jobs described it, the combination of technology and the humanities in every aspect of the product experience—the physical design, the user interface, the retail experience, the access to music and video content, even the way the product was nestled in its box.

Many of us aspire to some of Job’s best qualities (though the world would not necessarily be a better place if we possessed all of his quirks too). His accomplishments do, however, set a challenge to all of us. How do we look at a tough problem—whether it’s building a new electronics platform, addressing rising energy costs or raising a daughter—and come up with a fundamentally better, more complete, more responsible path to success?

I’ve been thinking about this challenge in a new domain—advanced computational imaging and video analysis. This is not just big computational problem. Many of the applications really want trillions of pixel operations per second. But it’s also a challenge in harnessing the creativity of the imaging and video community. Bright imaging algorithm experts come up with clever new methods constantly, but few of those algorithms make it into the mainstream because they are so hard to implement. The extraordinary computation requirements usually make totally hard-wired logic implementations the only deployment option.

But what if we could do trillions of operations per second in a small fraction of a watt, in the corner of a small chip? What if we could change the basic programming model for imaging so that new algorithms (for 3D gesture recognition, for facial expression interpretation, for dramatic image improvement) could just compile and run without specialized software development effort? What if we could map out a scaling to performance levels 100 to 1,000 times the performance of today’s typical mobile device processors? Now that might be worth doing!

I’ve also been looking back at my last post. I was then just getting ready to apply my model of “preparation + intensity = success” to a foot race. Well, it worked! I had perfect running conditions in Sacramento—cold and clear. I went out faster than I planned, but managed to keep the pace (eight minutes per mile) for the whole distance, finishing in 3 hours, 30 minutes, which is 10 minutes ahead of my Boston Marathon 2013 qualifying time. That’s a good run for an old guy like me!

–Chris Rowen is chief technology officer at Tensilica.

Running A Marathon

November 3rd, 2011

By Chris Rowen
Entrepreneurship has a lot in common with running marathons. It may seem like a simplistic cliché, but the analogy works at multiple levels.

The surface connection is obvious—running a marathon is a huge effort, sometimes painful, and typically a little slower than you’d like. But ultimately, it is spectacularly satisfying. Building a company from scratch is much the same – it takes huge effort to build a team, sift through all the good ideas to find great ideas, win over the first brave customers, and then scale up the business into a serious technology franchise. (And even if you’re highly successful, the effort and time are usually larger than the first business plan suggested.)

The entrepreneurship-marathon analog works even as you dig deeper. Succeeding in a marathon requires two distinct kinds of effort. First, you need to train. That means long hours on the road simply putting in enough miles to get your body ready to run 26.2 miles in one stretch. Even if you don’t care about a specific time in the marathon, you need to get to the point of running more than 30 miles per week in the last couple of months before a marathon, with the longest runs of at least 15 miles or more. You need sustained commitment! Second, on the day of the race, you need to make a mental commitment to succeed. Running that distance is not comfortable or easy. It’s painful and boring. You want to stop, but you must go on. If you’re going to finish, if you’re going to set a personal record, you need intensity!

Building a company requires exactly that sort of commitment and intensity, not just at an individual level, but across the team. The sustained commitment is particularly visible in the development of the product. Everyone involved in architecting, designing, testing, documenting, releasing, maintaining and supporting the product is putting in long hours, often at the expense of other more comfortable activities with family and friends.

The critical role of intensity is particularly important in winning business. Few customers want to buy from a small, untried start-up company. They’d rather make safe and easy decisions. To get those first customer wins requires a great product and a strong technology foundation, but that’s rarely enough. It requires an obvious intensity and dedication to getting that customer’s agreement. You often see that intensity in the most successful sales efforts. In technology sales, that usually means a focused team of people—sales managers, applications engineering, factory experts and company leaders working relentlessly to connect with the customer’s hard problem, to overcome objections, to craft novel business models and product variants, and to drive to closure and delivery. This sort of success on “race day” is built on preparation, but also intimate and intense teamwork to bring all the pieces of the solution together at the magic moment.

I only started running with any seriousness a couple of years ago. It’s my form of middle-aged renewal. Right now, I’m in the middle of training for my third marathon (California International Marathon in Sacramento, Dec. 4). I’m running up to 45 miles a week, hoping to get my system tuned to the point that I’ll be prepared to set a personal record. On the day, I’ll need all my intensity. I’m trying to break 3 hours, 40 minutes. That’s nothing compared to what elite runners do. Patrick Makau just set a new world record of 2:03:38 a few weeks ago. But it would be a good run for an old guy like me.

–Chris Rowen is the chief technology officer at Tensilica.

Squeezing the UE ’Til It Hurts

October 6th, 2011

By Chris Rowen
“UE.” Oh, what dry and obscure term, but that phrase holds the key to the mobile, always-connected world. “UE” means “User Equipment,” that exploding class of consumer devices that communicate over wireless with the rest of the world. In particular, it means smart phones, the most global and coveted of personal communications gadgets in history. Smart phones are already shipping at a rate of more than 400 million units per year, and Gartner Dataquest projects volumes above 1.1 billion units by 2015. The world has never before seen such growth for a computing and communication product. But what is happening inside these devices to make them successful and affordable?

The state-of-the-art smart phone is a remarkable, modern platform, but success and affordability can be largely traced to some old-fashioned principles applied inside. First, the chips need to be small. That means efficient use of memory, optimization of processors and logic, and rapid adoption of the densest available mainstream silicon fabrication technology. Second, usability depends on battery life, and battery life depends on energy efficiency. (You saw that coming, right?)

Energy efficiency is not just a design target, but a pervasive mindset that changes system architecture, software, processor design, logic libraries, and transistor characteristics. This is particularly true in the constantly active subsystems within the phone such as wireless Internet downlink access. Leading-edge phones now require peak down-link rates in excess of 100Mbps (for 4G LTE Category 4) and sustained down-link rates in tens of Mbps. Driving ultra-low power at these data rates turns out to be one of those Grand Challenge problems that inspires basic technology progress

In recent months, I’ve been intimately involved in pushing the envelope on low-power DSPs for 3G and 4G wireless baseband processing. I now see five big lessons:

  1. Work from real, production-grade baseband solution algorithms and software. Rough approximations for key DSP kernels may offer useful early design exploration hints, but it is tough to know the performance (and power) contribution of each hardware feature without detailed profiles.
  2. Partition the baseband hardware into general-purpose and special-purpose elements. Baseband processing has become so complex, especially with the near-universal requirement for multi-standard 2G/3G/4G support, that completely hardwired design is untenable (because the design would be too fragile). Similarly, making every processing element a general-purpose legacy DSP is equally untenable (because the design would be too big, slow and power-hungry). Instead, make each computing element just programmable enough to serve the range of expected algorithms it must run. (Happily, configurable processors make “just programmable enough” easy to achieve).
  3. Squeeze the DSP architecture ’til it hurts. SIMD/VLIW processor architectures are among the most efficient known. Ample vector registers reduce memory power. Good compilers ease software development from C. However, there are many ways to specify SIMD/VLIW architectures. By focusing on the most common load, store, arithmetic, shift, logical and data reordering operations, the gate complexity of SIMD/VLIW architectures can be sharply reduced. We’ve found that careful allocation of load/store interfaces and register file ports, optimization of the computation pipeline and reduction of less common operations, even while increasing the number of VLIW operation slots, gave dramatic improvements in energy per operation and total DSP size.
  4. Measure, measure, measure. As we enter uncharted waters in streamlined baseband design, each day brings hardware and software design decisions. Can we drop this operation? Is it less energy to use more registers or reload this data? Should we lengthen or shorted the execution pipeline? Initial intuition is limited. Good decisions need good data. So we build countless variations of the processors and measure performance and power of the relevant software in detailed simulation. Processor automation allows us to build, verify, program and simulate (all the way down to the gate level) in hours.
  5. The benefits of lean design compound on each other. In this latest round of baseband design, we’re been pushing processor instruction set streamlining, new pipeline organization, new memory systems and improved low-power VLSI flows. We expected each to make noticeable contribution, but we didn’t anticipate how the parts would interact. By making instruction sets simpler, we were able to compress the execution pipeline. This further reduced the gate count in the design, which in turn reduced the wiring design in every stage. Shorter wires helped the layout process so that have reached higher final gate density. As a result, our first complete prototype beat our initial power targets by more than 40%. We got to reduce our power targets!

By applying all these ideas together, we’ve been to demonstrate power reduction of more than a factor of four, for a given process technology, for the essence of 3G/4G wireless baseband. We’re on the cusp of a new generation of more programmable baseband platforms with not just better flexibility, but also smaller size and longer battery life, than ever before.

–Chris Rowen is chief technology officer at Tensilica

Power. Power. Power.

September 8th, 2011

By Chris Rowen
You’d have to be dead not to recognize the increased attention to energy efficiency in all sorts of electronic designs. This is particularly true in wireless baseband design for advanced mobile handsets. In the march from 2G to 3G to 4G, we’re seeing the wireless computation requirements go up by about 4 orders of magnitude. Silicon technology does drive to lower power over time, but not THAT fast. So we need to look to other ways to beat the heat and save the battery. At the same time, though, the complexity of the computation is growing, particularly as phone makers want 2G AND 3G AND 4G all running on the same platform.

So how can we get there? One thread is simply making general-purpose digital signal processors more efficient. There’s ample room for improvement there, driven by DSP architectures that use registers and memories more efficiently, leverage smarter compilers for more parallelism, optimization of the instruction sets for wireless data-types, and pipeline optimizations to reduce clock and data toggling.

But sometimes you need to go even further. Traditionally, going further in energy efficiency means building hardwired data-path logic to implement simple DSP functions like FFTs, FIR filters and matrix multipliers. That can help enormously in power—sometimes 4x or more in power compared to traditional DSPs—but it comes at real cost. Wireless baseband designers are forced to make hard choices such as the number of filter taps, the size of FFTs, the flow of data, and the sequencing of operations during the initial architecture phase, perhaps years before actual rollout in wireless systems. This leads to a mix of serious overdesign in area (and power) and nasty surprises as required algorithms evolve in real-world trials and deployment.

There is another, probably better, way: small programmable dataplane processing units tuned to those specific tasks—FFT, FIR, matrix multiply, with two key characteristics. Their computational data-paths look a lot like the hardwired data-path cousins, but these are integrated with tiny versions of the general-purpose processor. This means you can get the benefits of programming a “real processor,” such as running any C code, easy debugging and software upgrade in the field, compatible migration of binary utilities functions around the system, and programmable self-test and self-configuration. And you get the power and area footprint of hardwired logic.

I recently prototyped an example of this for complex FIR filtering at rates up to 64 complex FIR taps per cycle—256 multiply-adds in parallel. (Not even Tensilica’s new BBE64 DSP can sustain this rate in a single core). As a programmable processor, it can optimally reuse data and coefficients, reducing the data fetches from RAM compared to naïve implementations. The processor part of the solution logic, which is the stuff that makes this filter programmable, is a tiny fraction (less than 10%, without memory). In a 40nm process, the extended FIR processor, including all required instruction RAM, has a total area little more than half a square millimeter and power approaching a micro-watt per multiply-add per MHz. That’s competitive with hardwired filters for the same.

This is clearly where power-efficient handset design is going. It gives you most of the benefits of heavily optimized data-path design for the simple signal processing functions that dominate energy consumption, plus most of the programmability benefits of the bigger, more general-purpose DSPs that run the other even more complex functions. This is just another example of the 90/10 rule—90% of the power goes into 10% of the DSP functions, so you need to do an especially good job on those functions!

–Chris Rowen is the chief technology officer of Tensilica.

Start Your Engines

August 11th, 2011

By Chris Rowen
Leadership in microprocessor architectures evolves over decades, and the intellectual battles for leadership provide sustained enlightenment and entertainment for programmers and engineers of every stripe. All this comes together in the big technical conferences on processors.

Leadership in processor conferences also has evolved over the years. These days, the Hot Chips Conference has risen to the top of the heap. Hot Chips 23 is coming up Aug. 17-19 at Stanford University and it looks to provide both entertainment and enlightenment.

The entertainment comes from now traditional battle of the advanced mainstream processors—x86 versus ARM, and within the x86 world, Intel vs. AMD, with talks on Intel’s second-generation Core microarchitecture, AMD’s Llano APU and Bulldozer cores, and ARM’s high-performance mobile CPU roadmap. And of course, the other architectures—IBM, Itanium and SPARC—are sweating to carve out niches of sustainable relevance around the mainstream microprocessors.

There’s also some cool technology with high entertainment value, including a whole Intel paper on generating better random numbers, Microsoft talking about the inner workings of the Kinect gesture recognition system, and new networking interfaces that push transaction rates into the billions per second. And for those who like a good fight, there’s a panel on “The Ecosystem Wars,” with bitter rivals in both processor architecture and operating systems facing off.

The best enlightenment will come from the remarkable range of talks on higher levels of silicon integration, successful scaling to many-core platforms and application-oriented processors. Talks on multi-core network processors (Cavium), multi-core security (Tilera), breaking the communications bottleneck in large-scale systems (UC Berkeley), many-core data center servers (SeaMicro) all highlight the importance of the parallelism problem. The Wednesday seminar on package-scale power manager also goes at the heart of a key issue.
And in the enlightenment category, I get to mention my big talk on Tensilica’s latest baseband engine, “The Worlds’ Fastest DSP Core: Breaking 100 GMAC/s barrier.”

The best part of this sort of conference, though, is the hallway interaction. The combination of camaraderie and intellectual competition is a compelling mix. It’s sure to be a good show.

–Chris Rowen is chief technology officer at Tensilica.

The Next DSPs

July 21st, 2011

By Chris Rowen
One of the great aspects of being a CTO is that I get to work on such a juicy range of interesting problems. Sometimes it’s strategy for penetrating new markets. Sometimes it’s building long-term technology roadmaps. And sometimes it’s getting my hands very dirty working on intense and practical product innovations.

In the last year I’ve spent a lot of time working on the architecture of our next generation configurable DSP family. The requirements for a new DSP reveal a fascinating set of contrasting needs, especially given our drive to push DSP performance by up to 10x as well as improve operations/watt by up to 4x in a single generation.

Traditionally, DSP architects face daunting conflicts among goals. DSPs arose in the first place because they could be more focused—and hence, more efficient—on arithmetic computations than CPUs and microcontrollers. But how specialized should a DSP be? TI’s successful C6x DSP core family uses VLIW instruction set ideas to execute a set of up to eight more-or-less independent and generic operations per cycle. This leads to decent performance across a wide range of tasks, but perhaps not such outstanding performance-per-mm(2) or performance-per-watt as more narrowly specialized cores. At the opposite extreme, you find highly specialized data-path engines built for a single task like FFT computation or FIR chains. They are programmable only in the sense that software running on some processor can set parameters (like coefficients) and initiate operations. These two styles can differ in energy and area efficiency by more than an order of magnitude.

Processor configurability has now entered the mainstream for architects of complex chips, especially for data-intensive applications in wireless communications, imaging, multimedia processing, security, storage, networking—in fact, anywhere that cost and power demands intersect with growing data rates and algorithm complexity. It turns out that adding the instructions and interfaces needed by the applications’ compute and communications patterns, and removing the instructions and interfaces that aren’t needed, can improve efficiency by a factor of 10 or more. So making a state-of-the-art DSP core configurable is both easily possible and increasingly necessary.

What other requirements did we need to consider? For the broadening DSP family we’ve wrestled with following needs:

  1. Make it extremely efficient in the target domain—4G wireless baseband and related high-throughput communications tasks up through LTE-Advanced, the 1 Gbps communications standard that will become universal in the second half of this decade.
  2. Make it scale to extraordinary performance levels—100 GMAC/s per core and 1TMAC/s per chip should be readily available to system developers.
  3. Make it fully programmable at peak performance from C code, including automatic vectorization from ANSI and C code with scalar and vector DSP types with standard C operators. Make assembly code development completely obsolete by enabling C-based development at the same performance in all circumstances.
  4. Include specialized instruction set features for the most demanding specialized DSP functions commonly found in advanced wireless, including high data-rate parallel FFT, DFT, FIR and matrix-multiply operations.
  5. Make it small and well-suited to current and next-generation process geometries at 40nm and below. Ensure that VLSI structures that are becoming relatively more expensive over time, such as memory connections and global on-chip wiring, are utilized and leveraged sparingly and intelligently so DSP core efficiency improves with each successive process node.
  6. Give a wide range of point-and-click configuration options so that system-on-chip designers can tailor a unique set of instruction set features, interfaces, and memory systems to their exact needs. One processor generator should be able to create cores tuned to fit dozens of different computation rates, data types and communications styles.
  7. Allow architects to simply describe entirely new private operations, instruction formats and programmer state and have these fully incorporated into the hardware, compiler, debugger, RTOS and multi-core communications environment.
  8. Deliver complete DSP libraries, integrated code development environments, popular debuggers, fast, cycle-accurate and pin-accurate models, and FPGA prototypes.
  9. Make code developed on current members of the family fully compatible to preserve the large investment in code customers and partners have already made.

This next DSP is a complex undertaking as we balance instruction encoding density, size, power, compiler efficiency and configuration flexibility. To do all this, we’ve had to put together new tools, some just for internal use. For example, we’ve built new mechanisms to track the evolving instruction set architecture. For each proposed decision on instruction operands, we get instant feedback on name conflicts, register file port utilization, VLIW formats, instruction encoding space and configuration package usage. We’ve developed an abstract C programming style for highly tuned code that allows one DSP kernel library or verification suite to cover DSP configurations that may implement all the different possible core sizes and configurations. We’ve leveraged our large body of DSP function kernels, examples, and applications to get realistic feedback on the ability of the compilers to exploit the novel instruction features of the new generation.

The process of designing the next DSP core has done two things. First, of course, it has created a potentially important new engine for the wireless world. More importantly, perhaps, it has led to a quantum step in the automation of core architecture, tuning and verification. That will bring future cores, even those in areas far removed from wireless, sooner into the market at higher energy efficiency, completeness and raw performance.

–Chris Rowen is the chief technology officer at Tensilica.

Limitations Of General-Purpose Processors

June 16th, 2011

By Chris Rowen
There’s been a lot of discussion about finding the right mix of speed, power and performance in SoC cores. By far the best approach is tailoring the processor core to the exact task, not tailoring the architecture to fit the core.

General-purpose processors are simply not fast or efficient enough to do the hard work embedded in 4G smart-phone basebands and DTV media processing SOCs. General-purpose processor architectures are not fast enough because they implement only generic operations – typically primitive arithmetic, logical, shirt, and comparison operations on 8-, 16-, and 32-bit integers—and because they perform little more than one basic operation per cycle. This moderate performance is perfectly adequate for typical user interface and control applications, low-resolution image manipulation and low-bandwidth signal processing.

The most demanding mobile and living-room signal, image, protocol, security, and other data-intensive tasks require tens or hundreds of operations per cycle and power levels in just tens of milliwatts. This takes general-purpose processors out of the picture. Today’s digital cameras, cell phones, DSL and cable modems, high-performance routers, and digital televisions all use special-purpose processors or hard-wired logic circuits for their most demanding data functions.

There are four key issues that need to be addressed in the SoC dataplane.

1. Data throughput. Using bus interfaces to transfer data is common practice, but it’s also slow. A better option is to bypass the main bus entirely, directly flowing data into and out of the execution units of the processor using a FIFO-like process, just like a block of RTL.
2. Fitting into a hardware design flow. Using existing tools, can designers simulate the processor in the context of the entire chip? If not, what’s the penalty.
3. Processing speed. Can the performance be optimized for an application, such as video, audio or communications? That can result in orders of magnitude difference.
4. Safe Hardware and Software. Can the core be optimized without penalty? This is critical because most designers are not processor experts, which makes them hesitant to customize a processor architecture for their needs.

Processors need to be fully programmable, but they also need hardwired logic functions to do the “heavy lifting” in the SOC dataplane. And for every processor that’s used as a traditional controller, there need to be many more standing behind it to do the hard data processing work.

–Chris Rowen is CTO at Tensilica