Posts Tagged ‘Arteris’

Next Page »

Rethinking SoC Architectures

Thursday, May 24th, 2012

By Ed Sperling and John Blyler
Virtualization and coherency, two concepts that can trace their origins back several decades, are suddenly gaining attention these days—but for entirely different reasons and uses.

A good way to think about virtualization is as an opportunistic use of available resources. Rather than waiting in a queue for a single processor core in a multicore SoC, for example, virtualization allows a compute task to take advantage of whatever processor is available if another is in use.

The concept is hardly a new one. Virtualization was invented by IBM in the late 1960s as a way of running batch processing while also still doing other work. But virtualization also creates a challenge for keeping caches in sync, which is why the concept of cache coherency was created. And as more cores are added into SoCs, rather than more processors within a single machine or on multiple machines, cache coherency has moved from mainframe to PC to processor and now across multiple processor cores.

What’s changing now is that these concepts are spreading well beyond just the processors. Virtualization is being applied to memory, storage, I/O and graphics processing units (GPU)s. But to make all of that work efficiently coherency will have to grow well beyond just the cache, and that may prove to be a very difficult problem, particularly in a multivendor ecosystem.

The starting point
Much of this shift has come into focus inside large data centers over the past decade as a way of reducing costs. In the 1990s, the availability of inexpensive blade servers and ever-increasing density made it possible to begin replacing expensive mainframes and minicomputers with off-the-shelf commodity machines. You could stuff them into a single cabinet and blast in chilled air to cool them sufficiently. Two things happened to change this equation. One is that the cost of electricity suddenly went up, because these cabinets were running hotter than ever before as density and current leakage increased. The second was that data centers had bought so many servers over a period of 15 years that just the cost of keeping them running was beginning to show up as seven-figure annual expenses for many large data centers.

Virtualization proved an effective way of reducing that cost because it allowed data centers to increase server utilization from an average of 5% to 15% utilization all the way up to 85% or more. That meant fewer servers overall, less electricity, less heat to remove, and far more available real estate. But with that problem now under control—at least for the moment—data center managers have shifted their attention to the exponential rise in the amount of data being stored. In the 1990s, most of the data was simply text or code. It is now a combination of text, video, data and voice, raising the same kinds of fiscal red flags about powering and cooling storage as servers prior to virtualization.

“What we’re seeing now is a move toward virtualized storage,” said Bob Pierce, flash business development group director at Cadence. “The next step is to merge storage and memory, which is why we’re seeing such strong interest in PCI Express. It’s a great transport vehicle. We’re also going to see back-end storage mixing with front-end storage data. What’s in between will be cache in the form of virtualized memory.”

This more fluid boundary between storage and cache has ramifications at all levels of design. It can affect everything from a processor to multiple processor cores on a single SoC, on multiple chips in a stacked die, and on multiple systems in a grid or mesh network.

“What’s happening is that you’re moving the back end closer to the processor,” said Pierce. “It changes the way big data and databases will be addressed in the future. If you have four CPUs, you can take them and, using PCI Express, prioritize them into a given drive sector and share them. That’s where all the VC startup money is these days. It’s the ability to configure servers for the function necessary at any given time. But you also have to virtualize storage and memory, and it has to be done dynamically.”

PCI Express has the dual advantage of adding a single protocol to keep all of this data coherent. While it’s useful to store and retrieve data quickly, it all has to be updated to reflect any changes that were made in any part of the system.

Adding other resources
Mixing storage and cache is fairly obvious, though. Less obvious is the mixing of processing between CPUs and GPUs.

“In the past, GPUs were directly assigned to a virtual machine (VM),” said Sumit Gupta, senior director of Tesla GPU Computing at Nvidia. “Every VM would get a full GPU, which meant that each server was limited by the number of GPUs it could hold.”

Nvidia’s new Kepler GPU architecture uses more cores—192 vs. 32—compared with its predecessors, and a significantly lower frequency of .175GHz compared with the old 1.35GHz. The result is faster processing with less power.

“We invented several technologies in order to virtualize the CPU, including an improved Memory Management Unit in the GPU,” he noted. “This is key because most of the data acted upon by the GPU comes from memory.”

But memory is being virtualized, as well, making this whole scheme even more complicated. Startup Memoir Systems, which touts its solution as algorithmic memory, is an intelligent virtualization scheme for almost any available memory in a system. And there are moves afoot to do the same for the multiple I/O feeds to improve the speed of downloads and uploads from a system.

Making it all work together
While virtualization all makes sense from a performance standpoint, complex systems aren’t just about performance. Coherency is a critical piece, and it’s an extremely difficult one.

“The reality is that I/O coherency has been around for a long time in the x86 world,” said Laurent Moll, CTO at Arteris (and formerly a systems architect at both Nvidia and Broadcom). “The next frontier is when you start adding in other devices, and there’s a big disruption when you’re adding full coherency between the CPU and other things. It’s easiest when you have a small team designing the cache and all the protocols. When you start plugging multiple things together it gets a lot harder. You need to be a lot clearer about the specification, the verification and the tests that need to be run.”

He said there are two key challenges in this scheme. One is simply getting it right, which is difficult for multiple companies using different teams and with different cultures. “It’s very easy to have corner cases that the guys who wrote the spec didn’t think about,” he said. The second challenge is that there is no known path to do this. Quite simply, it has never been done before.”

Conclusion
The upside of getting this right is a huge boost in performance. Being able to utilize more resources at any time can improve speed on almost every part of a chip or system, and virtualization plus coherency is a big win for the user.

The downside is that, assuming this can be done in the first place, it also could have an impact on power. The whole goal of most advanced SoC designs is to keep the majority of silicon dark except when it’s needed, and even then to run at maximum performance for a very short time to get everything done quickly. Having more resources to manage on an ad hoc basis solves the use model issue for performance, but it can create havoc on power management schemes.

In addition, it may require new software to even work in the first place. Cadence’s Pierce said some of this won’t even make sense on platforms such as Android until the multithreaded OS release called Ice Cream Sandwich becomes more prevalent.

NoC Power Benefits

Thursday, May 24th, 2012

The system-on-chip (SoC) interconnect spans the entire floorplan of a chip and consumes a significant portion of the power. The interconnects of today’s SoCs are a distributed architecture of switches, buffers, firewalls, register slices, and clock and power domain crossings. One approach is to implement these units modularly with a simple, universal transport protocol between all units. This approach enables unit level clock gating, eliminating clock tree switching power when no traffic is present. Modularity also localizes logic, which minimizes long wires and further limits power consumption by keeping capacitance low. The simplicity of the protocol also allows each function to be performed with minimal logic overhead, minimizing area and leakage power consumption. This design approach is worth consideration for power sensitive SoCs.

This technical paper explains why interconnects based on modular network on chip (NoC) technology comsume less power than older bus- and crossbar-based interconnects. To download this white paper, click here.

Coherency’s Next Frontiers

Wednesday, May 23rd, 2012

Laurent Moll, CTO of Arteris, talks about new types of coherency and why it will be such a big challenge.

YouTube Preview Image

Experts At The Table: Hardware-Software Co-Design

Friday, May 11th, 2012

By Ed Sperling
System-Level Design sat down to discuss hardware-software co-design with Frank Schirrmeister, group marketing director for Cadence’s System and Software Realization Group; Shabtay Matalon, ESL market development manager at Mentor Graphics; Kurt Shuler, vice president of marketing at Arteris; Narendra Konda, director of hardware engineering at Nvdia; and Jack Greenbaum, director of engineering for advanced products at Green Hills Software. What follows are excerpts of that conversation.

SLD: How much does the business side enter into this equation?
Greenbaum: The cost of the models balanced with the benefits you get from having those models, and the lack of well-defined interfaces, conspire people to just build the chip and do as best as they can. Think about the effort to build drivers for the actual silicon. If you’re going to push real workloads through and connect to the actual Facebook servers, for example, you need a driver that can talk to the model. Did you define that interface between the CPU execution model and the approximately timed video codecs sitting on the other side of a bus in such a way that you can still write a driver quickly enough to derive a benefit from that? Will it result in fewer chip spins to get a design right? It’s a challenge to just bring a model together. Then you have to make it economically useful.
Shuler: It is starting to get better. There is more fixed-cost work to do this up front, but when you’re creating one of these chips you need to develop a platform because you have to amortize the cost over multiple chips. Companies build these chips and then make four or five derivatives that may go all over the world. For them, the cost of putting together the infrastructure up front is paid for in the end because they can get multiple chips predictably from that one. The whole SoC trend is helping the adoption of software-hardware co-design.

SLD: No matter how close we bring the hardware and software, they’re still out of sync. On the front end you’re working off something started by a hardware team, and on the back end you’re working off something that hasn’t been finished by the software team. How does that affect everything?
Schirrmeister: Things are out of sync. The question is whether a driver has been developed to push traffic to a server via the model rather than the actual hardware? I’m convinced we’ll get there, and we’ll know we’re there when the model generation becomes a natural byproduct of the mainstream design flow. So a company like Arteris creates the fabric and models for different needs. Then it’s up to the user to decide which one to use. Maybe they only need the LT model because they don’t need to plug in a more accurate model that will slow things down. But sometimes you don’t know what you don’t know. Your requirement may mean you’re producing more memory bandwidth than the hardware can handle. You might have seen that if you ran it against an AT model, and you certainly would have seen it if you ran it against an RTL model. It all goes back to having the models available. Sometimes it isn’t commercially feasible, but it is getting better. We are going in the right direction.
Matalon: I’m more optimistic on that. For model creation, we need companies to provide models at all levels of abstraction. People are working in isolation. Some of this is because it’s a natural thing of teams working in India and China and North America. But sometimes they need to see the solution. The solution is enabled by providers of models, but you cannot see how, for example, a network on chip will be utilized unless it’s put in the context of an overall platform. To do that, we need models for everything and we need automation to create the AT, the analysis tools to do power and performance, and all of this is something the EDA vendors can provide. There will be some groups that hold back on moving forward, but there are solutions there that are not coming from a single vendor. They’re coming from the collaboration of IP providers, tool providers, semiconductor companies and embedded software companies. Now we need to get users to accept them—to break the hardware-software barriers between engineers and architects and software teams.

SLD: Are the models being updated by everyone and in all places?
Schirrmeister: That goes back to making them a natural byproduct of the design flow. If they’re not, there is no way of synchronizing everything.
Shuler: You can’t do it manually.
Schirrmeister: In the past we had models for a certain platform, but in the next revision do you really go back and update the model to be in sync with it? Only when it’s absolutely necessary. But I do think it is becoming a more natural byproduct of the flow even though we’re not there yet.
Greenbaum: Yes, it is getting better. There are a number of open-source platforms for which you can download a QEMU (quick emulator) model. But it’s still the minority and the opportunities there are being squandered. If you look at a virtual prototype as just a development vehicle that is there pre-silicon and you get no more value out of it than the silicon, then you’re squandering the value from a software development virtual platform, which is visibility. With a virtual platform you don’t have pay the cost to pin out the ETM (Embedded Trace Macrocell) on your ARM core to get trace.
Shuler: The same kinds of problems the hardware guys discuss the software guys are discussing. The way I look at it is that hardware is no different from software. It just has to become fixed at a certain time. RTL is software. If you’re a chip vendor you’re in software development. It’s parallel code, but half of it becomes fixed.
Greenbaum: As long as you have timing closure you’re correct.
Konda: It was the case that the hardware and software teams were doing their own things. But what we see now, because of the market pressure and the increasing software content, is that the software team is taking a much more proactive approach. If we are designing a GPU and a CPU, the hardware team and the architecture team will be developing the C model and the functional model of this. But now the software team is knocking on the door and saying, ‘Hey, give us that model. We want to run our software code on it.’ We are seeing a lot more collaboration.
Schirrmeister: The reason that’s happening is that someone who owns both teams says that if they don’t do it, software will be too late.

SLD: But what happens if the models get out of sync?
Konda: If you look at an SD (secure digital) card, the specs are changing from 3.0 to 4.0. The focus of the hardware team—two or three engineers, one verification guy and two RTL guys—is to write the spec, look at RTL, verify it’s working and it’s done. They don’t care about software, system integration or anything else. But now you’re trying to realize an SoC and searching for ESD 4.0 model. Now you’re searching the globe to find models for these interfaces.
Greenbaum: And they still don’t match your RTL because the DMA (direct memory access) is not a standard part of that SD interface and you lose. Even for standard interfaces there are no models.
Konda: So yes, models do get out of sync. The C model probably doesn’t agree with RTL we are developing. The other problem is trying to find a valid model from somewhere. The audio guy with two engineers doesn’t have time to develop a fast model. It’s not his job description.
Greenbaum: And we haven’t reached the point where we can generate those models, either.
Matalon: In the past, the hardware engineers owned power and performance. Today, performance and power are controlled by software. The hardware can put in hooks to control voltage and scaling, but the hardware engineers have no clue when that will happen. The first big change is that co-design will be a bigger interest for the software guys because the top manager will start blaming the software team if a smartphone runs out of battery in four hours. It’s because the software guys have not used all the resources correctly. We’re seeing a shift of responsibility from the hardware guys to the software guys who have the overall system view. A second change involves the interoperability of models. You can no longer build a complex SoC based on proprietary models that are created ad hoc. They need to be TLM 2.0, they need to be standard, and they need to be re-usable in the next project. If you use standard models and make this investment, the payoff is huge. If you don’t do it, you’re stuck. A lot of software and hardware and architecture teams are aware of that. It’s happening.
Shuler: The dirty little secret is that hardware companies put all of this capability into a chip, but when they create the boards and packages they use a fraction of it. The end device manufacturers use even less. They’re using 10% to 20% of the capability.
Matalon: That’s why you can reduce 80% to 90% of your power if you evaluate power in the early stages of the architecture.

The Week In Review: May 11

Friday, May 11th, 2012

By Ed Sperling
Synopsys continued on its acquisition path, this time buying RSoft Design Group, which makes photonics design and simulation software. Synopsys has been pushing steadily into the optics design market, beginning two years ago with the acquisition of Optical Research Associates.

Cadence won a deal with Fujitsu Semiconductor, which is using Cadence’s Chip Planning System to build microcontrollers. Fujitsu ranks seventh in the world in the MCU business, with 5.5% of the market, according to Data Beans. The company was No. 3 in 2010, so apparently it’s time for some serious retooling.

Arteris won a deal with IC-Logic, which licensed its network on chip and interconnect IP for automotive infotainment SoCs. IC-Logic is based in Sulzbach, Germany.

Tensilica teamed up with VWorks to provide virtual prototyping platforms, especially for multi-core designs. VWorks does advanced simulation and modeling.

TSMC sales, which are something of a bellwether for chip activity, were up 9.3% in April compared with March, and 10.4% year over year. Things seem to be picking up. http://www.tsmc.com/tsmcdotcom/PRListingNewsAction.do?language=E

Experts At The Table: Hardware-Software Co-Design

Friday, May 4th, 2012

By Ed Sperling
System-Level Design sat down to discuss hardware-software co-design with Frank Schirrmeister, group marketing director for Cadence’s System and Software Realization Group; Shabtay Matalon, ESL market development manager at Mentor Graphics; Kurt Shuler, vice president of marketing at Arteris; Narendra Konda, director of hardware engineering at Nvdia; and Jack Greenbaum, director of engineering for advanced products at Green Hills Software. What follows are excerpts of that conversation.

SLD: How often is co-design really warranted?
Greenbaum: The case for power and performance may be more common than you think. Looking at the products from Nvidia, is a Tegra processor a standard product or an SoC? You put a Tegra in an automobile running an IVI (in-vehicle infotainment system and maybe an instrument cluster. That instrument cluster gets prototyped on a desktop. Then someone gets the idea of taking that high-polygon-count design and putting it in a car. Now you have a cost you have to optimize. If I can’t do it with today’s Tegra, can I do it with the next one? You’re not going to know. That’s one advantage of co-design. You can know before the sand gets melted. Even if your application can be implemented in software, co-design to let you know what price point you can do it at is very valuable.
Konda: We are delivering a solution today that is not just silicon. It used to be the case 10 years ago. But today it’s a complex piece of software and highly complex software. These two things have to come together. First and foremost, we have to make sure the design is in a semi-working condition before bringing in the software team to develop their code. To do that, we cannot wait for emulation or an FPGA. These bits and pieces of a design where we have multiple cores, multiple processors in that SoC—20 to 25 processors and components and interfaces—creates a very complex device. RTL is not available all the time. Some parts of the design are in RTL, some parts are in C models. As soon as we have a little bit of confidence we want to encourage the software team to come in. That’s very early in the design cycle.
Shuler: Why do you wait so long? Why don’t you do it as soon as you get the initial requirements?
Konda: We have been doing that for a number of years, but it’s still not a full-fledged solution. In our case, we have 30% to 40% of the design modeled at the very beginning and software teams are already working on that model. But it is not the entire SoC. They are working on the GPU or CPU portion of the SoC. How do we model all of these peripherals? That’s not there yet.

SLD: There are two trends unfolding in IC design. One involves a general-purpose processor, where you may leverage one or more cores and only a specific amount of memory. The second is a very specialized processor where you may run a specific application. How does co-design deal with these different approaches?
Matalon: One issue is really validating the spec. The earlier you can capture your specification using an executable specification where you don’t just generate UML diagrams but simulate it dynamically to represent the conditions upon which the specification is working is an ideal solution. This is a level above implementation. That’s very important. If you go one level down and start doing partitioning in LT mode, you can’t really evaluate the tradeoffs. It’s good to refine the specification, but you can’t validate the performance, bandwidth and power are there. In my view, the key is to first focus on architectural exploration to make sure you have the right performance and power. For that you need an approximately timed model that allows you to do an evaluation of performance and power for your standard processor, for your specialized processor, for multiple cores, for all the combinations and topologies. It cannot be too low-level so it can be used in a way where you can do power/performance evaluations, do a power budget, and if you need to, you can shut it off and run other parts very fast. That’s where I see the ideal solution. Some customers are using it and many customers are not. You can’t wait for the RTL. It’s too late. You can do co-design, co-validation through all the stages of implementation, but from a design perspective for these types of designs you have to start above RTL.
Shuler: When we’re talking co-design, it’s a people problem, not a technology problem. You don’t see kick-offs where there are hardware and software people in the same room—or even where the architects and verification people are in the same room. The semiconductor vendor is responsible now. When you think about it, the real customer of the semiconductor companies are the software vendors.
Greenbaum: Absolutely. And very few semiconductor companies recognize this.
Matalon: It’s not as bad as that, but it’s not yet the prevalent methodology. Co-validation is already quite entrenched, because emulation, acceleration and virtual prototypes are really co-validation. The co-design—evaluating the performance and power—is still at the early stages.
Greenbaum: The big difference is between code software drivers written for verification or validation vs. those written for real applications. The semiconductor vendors that are doing the worst job of delivering a full platform of silicon and software don’t understand the difference. They’re delivering verification code, but when you try to use it in a software environment it rolls over and dies very quickly. There is a spectrum of companies that get it, and if you look at the acquisitions in the embedded software arena—Cavium acquiring Montavista, Intel acquiring Wind River, the in-house Linux teams that are pervasive and Mentor with the Nucleus product—we’re seeing the recognition there. But only the top vendors are there today.
Schirrmeister: There is always the Yin and Yang in here. We have polar opposite trends. There is the generalization of the processor, which is meant to not shoot yourself in the foot. In the embedded space you have Java-based applications. Those development environments are built in a way that is very abstract. On the other hand there are highly specialized processors enabling highly specialized applications—highly specialized hardware with an abstraction layer and then the application development environment built on top of it. Now, going back to the models, if you had the model generator where you just talk to it and it creates the AT model, that would be the perfect environment. As a practical matter, what chipmakers are looking for is the ability to mix and match. The AT model is great to represent some of those effects, such as area, power and performance. But in the next version, the question might be slightly different. That makes it very hard to build those AT models.
Shuler: We have to do all three in addition to RTL. We have cycle-accurate, loosely timed and approximately timed. You never know what people will need.
Schirrmeister: As a practical matter, there are a lot of people using AT models. But in parallel people are taking the appropriate model for the system and hooking them together. So you have emulation or rapid prototyping for the pieces that are already stable, which is where IP re-use comes in. You may not have to rebuild them as an AT model. Having a processor model of the next big.LITTLE chip and execute the software, and for the subsystem that does more computing to be able to analyze the performance, allows you to create the right mix. Would the perfect environment be to have AT models for everything? Absolutely. Can you practically build AT models for everything? It may not be possible all the time.
Matalon: AT models are ideal because the other models are too slow or don’t have sufficient information. I disagree it is difficult to build them. We automate that from simple definitions. The challenge we see sometimes is that the functional model doesn’t exist. You have a design that is very complex and now you want to build a functional description that is equivalent to RTL. To wrap it with timing and power can be fully automated. Even the entire platform can be fully automated. But when you have a complex design and you want a functional abstract model, you have to write it yourself. If you put the RTL on an emulator and connect it to the rest of the models you are missing some of the capabilities of how you evaluate power or performance and you’re using RTL again, so what’s the point?
Konda: On the models front, it’s true you will not be in a position to provide models for the entire SoC. And to expect a functional model from an EDA vendor is not realistic. In an SoC environment, we have a number of interfaces and devices that get attached to the SoCs that are standard specs. At Nvidia we design our own CPUs. We also use ARM CPUs. We have a GPU. If you look at where these teams are, the CPU guys are in Santa Clara, the video guys are in Shanghai, and the simple interfaces like an SD card are in India. To realize the SoC model there is no common platform that pulls all these things together. The CPU and GPU guys are forced to develop a high-level C model. They start doing their work earlier in the cycle because they are forced to. Each team is doing whatever they have to do. But bringing all of these things together to create an SoC is the biggest missing piece. If you run a Facebook or a Twitter application, power and performance are key. So how do we estimate the power consumption? We cannot do this in one platform. With emulation it is too slow. With an FPGA, by the time the FPGA starts working the chip has already come back from the fab. It is a mix of parts of the design on an emulator or an FPGA, which is a real model, and then you hook up the rest of the design that gives a good approximation of the real system. It may not be highly accurate, but if you can estimate power and performance plus or minus 10% that’s still great.

The Interconnect Game

Thursday, April 26th, 2012

By Ed Sperling
Having a single bus protocol is something most SoC engineers can only dream about. Reality is often a jumble of protocols determined by the IP they use, which can slow down a design’s progress.

The problem stems largely from re-use and legacy IP. While it might be convenient to use only on an AXI standard protocol from ARM, most chips are a combination of IP tied to specific protocols that require complex interconnects, add significant time to the verification process, and often have an impact on performance.

“It’s never AMBA, Sonics or Arteris for everything,” said Mike Gianfagna, vice president of marketing for Atrenta. “There are a lot of configurations on a chip. You’ve got crossbar switching and arbitration schemes. The big question, particularly when you get into 3D stacking, is which one you should use. So you come up with half a dozen configurations and you experiment for power, performance and area.”

He said the on-chip interconnect problem is one more complexity issue that has to be ironed out. But it also has some unusual pitfalls. “An IP block is like an amoeba. It can morph in unpredictable ways. You need to be able to analyze that up front.”

How we ended up here
There have been a number of attempts over the past 15 years to avoid this kind of problem. In 1996, when the Virtual Socket Interface Alliance (VSIA) was formed, SoCs were still in their infancy even though more and more chips included some sort of processor. The hot topic at that time was whether to decouple the processor from the chip and isolate components from the interconnect. That gave rise to a handful of ARM standard buses.

“The job of the interconnect fabric is to just make it work,” said Drew Wingard, CTO of Sonics. “But what’s happening in designs is the total level of integration is going through the roof. We’re now seeing chips with more than 100 IP cores, MPEG encoders and decoders and Huffman algorithms, and you need the interconnect in a subsystem to be a good match for what you’re trying to do. The interconnect needs to be optimized for that.”

But within a single design there may be dozens of interconnects from multiple vendors, including some that were internally developed by the chipmaker.

“There will still be custom semiconductor companies doing their own interconnects,” Wingard said. “But for the bulk of the design, the number of interface standards generally is going down and most IP cores are much more latency tolerant than they used to be.”

Past, present and future
To a large extent, SoC developers are suffering from the same kind of backward-compatibility issues as software and processor vendors have been wrestling with for decades. What makes it an issue now is the level of integration and the emphasis on re-use of IP because of cost and time-to market constraints.

“If you look at the big companies, there is a long legacy of using things so they have a lot more heterogeneous stuff,” said Laurent Moll, CTO at Arteris. “Some of it they got through acquisition. If you were to create a brand new company—and there aren’t many of those these days—with a clean sheet of paper they would most likely pick the IP that is homogeneous. So you might settle on AXI as the dominant protocol, and you might even be able to achieve that today because most commercial IP is available with AXI.”

He said the first reason companies choose a homogeneous interconnect fabric is integration and verification. “It’s easier to have one person be the expert on a team than have to work with a bunch of other experts. It also takes less time to verify, fewer tools, and less time to integrate.”

Also key is performance, but that’s far less of a clear-cut decision because not all IP behaves the same way in different designs. “There are sets of protocols that don’t like to talk with each other,” Moll said. “Even the same protocols sometimes don’t work as well together as you would expect.”

Even more complexity
Just getting these various IP blocks to talk with each other is hard enough. Doing it efficiently is as much art as science. But at the center of any discussion of power there is almost always the interconnect fabric.

“Logically, the longest wires on a chip are in the interconnect,” said Sonics’ Wingard. “You have to get to all four edges of the chip. That’s why interconnect architectures are frequently restructured to decrease the time it takes to get a signal from one side to the other.”

Wide I/O and stacked die are being viewed as a way of dramatically reducing distances on a chip by running them through an interposer. To a large extent, that’s an interconnect problem. With non-uniform memory characteristics, one chip may be one or two ticks closer, which in turn improves throughput and scalability. It also allows designers to load balance data structures and traffic, Wingard said.

The downside of this approach, again, is choice—too many choices, in fact.
“The Achilles heel of 3D is too many options,” said Atrenta’s Gianfagna. “You have to reduce the number of choices quickly. So even when you come up with your bus architectures, power domain management is still a big deal.”

Experts At The Table: Hardware-Software Co-Design

Thursday, April 26th, 2012

By Ed Sperling
System-Level Design sat down to discuss hardware-software co-design with Frank Schirrmeister, group marketing director for Cadence’s System and Software Realization Group; Shabtay Matalon, ESL market development manager at Mentor Graphics; Kurt Shuler, vice president of marketing at Arteris; Narendra Konda, director of hardware engineering at Nvdia; and Jack Greenbaum, director of engineering for advanced products at Green Hills Software. What follows are excerpts of that conversation.

SLD: We’ve been hearing about co-design for a long time. What are the problems that haven’t been resolved?
Matalon: The industry started moving to co-design 15 to 20 years ago with technologies such as emulation, but the big change is that emulation alone is too late. The RTL needs to be quite solid at this stage. Co-design today means above RTL. Co-design is not even enough because you’ve already defined your architecture. There is one level ahead of it where you need to really validate that your assumptions are being met. There is a major role for ESL in co-design and we’re seeing the industry is taking off. This is the fastest-growing segment because you need to start co-design before the RTL is implemented and validate the assumptions regarding funcationality, performance, power and area before key implementation decisions are made.
Shuler: There is still a tendency to come up with some requirements and hack away at RTL. A lot of companies are getting smarter about that now. They’re starting at higher levels of abstraction and working their way toward more detail. But it still doesn’t happen all the time. When a company is purchasing IP from different vendors, there is still a question about where they get their models. That’s a common issue that slows the adoption of the new way of adoption. There also are disparate cockpits that people use. There are commercial tools you can use for SystemC and TLM simulation that are really good and really help with ease of use. But some of the biggest companies don’t use commercial tools. What slows us down as an IP provider is that we have to make sure our interfaces and IP-XACT information works with internally developed SystemC cockpits. Those things are not specified. There are one or two or three people within those companies that know how it works.
Konda: From a need point of view, the design sizes have been exploding and chip sizes are doubling in size. On top of that, especially at the SoC level, we see a number of processors, 20 to 25 different interfaces—USB, Internet controllers, SATA—so co-design has become an absolute necessity. The second piece of the puzzle is software. If we follow the traditional model of the design and then you start developing the software it’s way too late. We are designing our own design environment—RTL models, C models, and simulating all of this at an SoC level. What we are doing is homegrown, but we also are looking at commercial tools to see if they would make our life easier. What we would like to see is mixing and matching various models. Some parts of the design might be an FPGA, some might be an emulator, some might be C models and others could be RTL models. It’s a complex problem and I don’t see a clean solution at the moment. The end goal is to realize an SoC as early as possible.
Greenbaum: There are two sides to this. One is that we’re frequently asked to fix it in software when the chip is already is in production. We’ve seen mistakes that make it all the way through to several revs of silicon. We also look at this as a tool provider, where there is a continual cultural divide between hardware engineers and software engineers. From the point of view of silicon that’s already in production, the biggest problem is memory bandwidth. We have a board in our lab right now and doing something as pedestrian as video capture while the CPU and GPU are busy causes time-outs on the PCI bus. You can’t stream video to memory while you have the processor and GPU busy. This was a function of the system architecture that the SoC was supposed to handle. The arbiter doesn’t arbitrate properly. There are no knobs or gauges on it. And the problem wasn’t discovered until very late. There’s nothing that can be done in software. This is the most common error I see. Memory bandwidth problems aren’t something organizations can tackle today.
Schirrmeister: There are problems on the tool side, as well, right?
Greenbaum: You would hope that to do hardware-software co-design you could get system architects, software architects and processor architects in the same room and solve the problems up front. Unfortunately they don’t speak the same language. The implementation folks, especially here in the United States, speak Verilog. The software engineers speak C. The system architects know how to draw boxes and lines. When I first learned about SystemC I thought everyone could speak the same language. It’s not happening quickly enough. Language is just the easiest place to see this cultural divide, and it will always prevent us from shortening our cycles.
Schirrmeister: The whole notion of becoming independent of hardware and software increasingly will be adopted. The engineers who know boxes and lines may expect that UML (Unified Modeling Language) and SysML (Systems Modeling Language) can be used, once they know functionally what the system will do, to attach requirements or whatever they need on top of it. SystemC models are great, but how do you get above that? There will need to be more automation. When I did my first design I drew gates and connected them by hand. Then we began connecting bigger blocks and the assembly was automated. Automated scripts will become more common in the future. And as we are spreading out toward hardware-software independence, the underlying automation that will bring them together will have to be created because it’s too complex for human beings.

SLD: What’s the starting point? Is it hardware, software, or both?
Matalon: The world has changed. It used to be primarily hardware-centric. Now there are two challenges. One is hardware-software co-design. The other is that there is no hardware at all. The majority of hardware designs are based on standard processors or embedded processors that can do everything. Hardware is not required unless the product is really addressing a niche in the market where power and performance are important. Not every design cares about those factors. If I want to control a refrigerator, all the computation and control can be done in software. Co-design is not for everything. There are a lot of designs that are getting out the door where the challenge is writing the software. When you get into co-design, it implies you have a problem that cannot only be solved in software. You may need higher performance. You may need a network on chip to implement a design. You need a specific architecture or arbitration or an accelerator to implement graphics, and you need RTL for that because a standard processor won’t do the job. Then you head toward multiple processors, multiple cores combined with hardware—that’s the sweet spot for co-design. There’s no doubt that models have been the drag on this. When the design needs to meet performance and power, and there is a software part where there are certain tasks that will be done with processors and others with network devices, you need to get to the problem of partitioning between hardware and software. You have to do it right so that you account for both hardware and software.

SLD: You’re talking about complete optimization?
Matalon: Yes. And here’s where we need to focus.
Schirrmeister: The challenge lies in not knowing what you don’t know. When I’m writing the software, I may not know the memory that talks back to the fridge doesn’t know that the grocery store where I’m going to buy my milk has run out. I may not have foreseen that scenario. To me the things always start at the functional requirements of the user, whether it’s a graphics design, memory bandwidth. People don’t get the requirements right in the first place before marketing changes them. And then how to transfer those requirements into co-design is a challenge we haven’t solved. That’s where the models come in. If you just do this at the LT level you may not see the memory problem. If you go down to the cycle-accurate level, you may not be able to run enough cases to figure out the configurations.
Matalon: From a practical perspective, you may not need co-design for all designs. In an SoC, where you need to access memory and peripherals and sensors, the ratio is probably 100:1. But the pain level and the complexity aren’t such that everyone needs it.
Shuler: It’s more up-front work.

Power Benefits Of Modular Interconnect Design Using Network-On-Chip Technology

Wednesday, April 25th, 2012

The system-on-chip (SoC) interconnect spans the entire floorplan of a chip and consumes a significant portion of the power. The interconnects of today’s SoCs are a distributed architecture of switches, buffers, firewalls, register slices, and clock and power domain crossings. One approach is to implement these units modularly with a simple, universal transport protocol between all units. This approach enables unit level clock gating, eliminating clock tree switching power when no traffic is present. Modularity also localizes logic, which minimizes long wires and further limits power consumption by keeping capacitance low. The simplicity of the protocol also allows each function to be performed with minimal logic overhead, minimizing area and leakage power consumption. This design approach is worth consideration for power sensitive SoCs.

To download this white paper, click here.

Bridging Hardware And Software

Wednesday, April 25th, 2012

System-Level Design talks about where the problems are with hardware-software co-design and how much progress we’ve made with Narendra Konda of Nvidia, Frank Schirrmeister of Cadence, Shabtay Matalon of Mentor Graphics, Kurt Shuler of Arteris and Jack Greenbaum of Green HIlls Software.

YouTube Preview Image
Next Page »