Posts Tagged ‘timing’

Rethinking Timing Optimization

Thursday, April 26th, 2012

By Ann Steffora Mutschler
As semiconductor manufacturing technology continues its march toward 20nm, SoCs are plagued with advanced interconnect delays, cross capacitance, and process variability, as well as area and power constraints—and the significance of these factors is increasing with each passing node.

“With lower nodes we are getting advantage on area, more and more logic is getting included onto the dies, which means many interfaces are also integrated on the same die. All the interfaces may have their own unique requirements of clocking and when they interface with each other there are many constraints that come into picture which result into complex clocking schemes for the ASICs,” explained Shrikrishna Mehetre, engineering manager at Open-Silicon.

In advanced SoCs additional modes of operation dictate that extra tradeoffs in timing closure must be made to make sure the device works in all of the different modes and with the minimal amount of power, said Andy Inness, place and route specialist at Mentor Graphics. “Fixing power in one mode may break an operation in another mode. You need to make sure you take both into account, so when you optimize one you don’t break a different one.” The physical effects of advanced nodes can have varied impacts because as the voltage is varied, the physical effects are different in different voltages.

Advanced timing closure requires the P&R tool to account for all operating modes, and all process corners. This significantly reduces the number of timing violations and helps for faster timing convergence. Source: Mentor Graphics

In recent designs, Open-Silicon has seen that additional complexity as it has integrated numerous such interfaces and ended up with as many as 400 clock domains, many of which are generated clocks and dependent on each other. While so many interdependent clocks interact with each other on a 18mm x 18mm die, the insertions delays and skews for all these clocks and their impact on the overall timing optimizations is huge.

In addition to the clocking requirement for the design functionality, complexity also comes from the DFT logic addition and the test requirement for such huge logic and memories in the designs because the clocking requirements for DFT logic add to the complexity of the design. In this scenario, the timing optimization is not just restricted to achieving better data path delays. It also requires consideration of physical effects such as interconnect delays, cross cap, the clocking skews and variability.

High frequencies = extreme variability
Companies like Open-Silicon are working with clocks ranging from hundreds of MHz to a few GHz at 40nm and 28nm technology nodes. At such high frequencies achieving the required skews and insertion delays for such a big clock network over large dies is very challenging. The variability across different functional/DFT modes and different process corners is very high. When combined with clock distribution over a huge die and the variability at smaller nodes, timing optimization needs to consider these large overheads, Mehetre said. Power optimizations using clock gating has become the norm, but achieving timing closure for clock-gating latches is an additional complexity the needs to be addressed.

In addition, timing optimization is not a standalone task anymore. It needs to consider all the parameters such as clock period available, physical effects (interconnect delays, cross talk, variability), and clock skews across different corners and modes. Adding in excessive margins at different stages (such as data-path optimization before CTS, clock skew opt after CTS, variability optimization after CTS is completed, SI optimization after routing is done) is becoming inefficient at smaller nodes. EDA tools needs to model these effects right from the initial optimizations rather than relying on margins at different stages. For instance, instead of applying a single OCV number for the entire design, Mehetre said it’s better to use AOCV (local OCV), and all the IP vendors need to support AOCV models at smaller nodes. So instead of trying to achieve a global skew number, effort should be put in to utilize the useful skew and planning for the useful skew should be done from initial optimizations.

The impact of shrinking process nodes
While there was an era when interconnect delays were marginal with respect to cell delays, the situation seems to be reversing beginning at 28nm. Enhanced physical synthesis considering the clocking and modeling of the interconnect/cross-talk delays, variability is needed to replace added margins for these designs. Also, logical optimizations that consider variability and different corners needs to be part of the initial optimizations, Mehetre asserted.

As such, there needs to be change in the way clocks are used for such huge designs. Rather than looking at how to optimize timing and clocks for the effects, the approach should be how to minimize the impact of these effects during the architecture and RTL design phase itself. This is especially true if the clocks are to be shared or if they are interdependent, and during the logic design the goal should be to reduce interdependency.

Paul Cunningham, senior group director of R&D at Cadence agreed that smaller geometries are having an impact and that, by far, the biggest impact is on the wires rather than scaling of the transistors. “The transistors are getting smaller faster than the wires. We went from being all about the transistors to more about the overall wire capacitances and then at the very advanced nodes—28nm and below—we’re seeing it’s more and more about the resistances in the wires.”

All of these issues require engineering teams to rethink the algorithms, which has led to a clock concurrent optimization process, he said. “The underlying model in timing terms is, rather than thinking about a path in the design—I call them chains—where a chain would be a sequence of functions that span multiple register boundaries. Eventually these functions form some kind of a feedback loop and it’s that loop that really dictates the speed. In a traditional flow, if I’ve got a four-pipeline stage loop, my speed is driven by whichever of those four pipeline stages is the longest. In a clock concurrent world, I’m going to take the average of those four pipeline stages and that’s what drives the speed of my chip.”

Embracing that approach has required major changes to the tools themselves. “It’s a complete genetic rewrite in terms of the algorithm,” Cunningham asserted. “To the end user, it just looks like a new kind of next-generation step in the flow. So I’ve just rewritten step ‘n.’ What comes into step n, what goes out of step n is still the same but the actual result is now much better for power, performance and area. You are really optimizing the clock parts and the logic parts at the same time.”

Overall, boundaries are blurring throughout the entire optimization flow and the trend in the industry is towards the different optimization steps becoming more and more intimately linked so it’s a very, very smooth transition and possibly its a more fine-grained transition as you go from step n to step n+1 to step n+2. Twenty years ago they were really very distinct, isolated steps almost as if you were running different tools, he observed.

“It’s really the sum of all the steps—making the combination greater than the sum of its parts—that’s where the industry is focused these days,” Cunningham concluded.

Experts At The Table: The Past, Present And Future Of Synthesis

Thursday, December 17th, 2009

Experts At The Table: The Past, Present And Future Of Synthesis
High-level synthesis raises the abstraction level, but it doesn’t eliminate the need for synthesizing at the RTL level; still no all-in-one solution.

By Ed Sperling
System-Level Design sat down to discuss the future of synthesis with Shawn McCloud, product line director for Catapult C Synthesis as Mentor Graphics; Chris Eddington, director of marketing for system-level products at Synopsys; Bret Cline, vice president of marketing and sales at Forte Design Systems; Sanjiv Kaul, executive chairman at Oasys; and Andy Biddle, director of business development at Magma. What follows are excerpts of that conversation.

SLD: Is synthesis changing, and if so, why?
Cline: It’s definitely changing. The reality is there is a merging of various capabilities throughout the synthesis design flow to allow better transformations of data to happen in shorter order. It’s easy to say it’s because of productivity, market pressures and cost, but that drives EDA all the time. The key point is that with tools like the new logic synthesis tools, the new place and route and the new high-level synthesis you’re able to combine functions that typically were in three or four different tools into one or two, which allows you to make better tradeoffs and ultimately get better results.
Biddle: The focus on RTL to GDS is becoming more important because of the challenges of 65nm, 40nm and now 28nm designs. Traditional RTL synthesis has to become much more physically aware. Engineers can no longer throw netlists over the wall and expect to meet timing on schedule. We’re seeing the merging of technologies, as well. There’s a need to tighten up the links between the front end and the back end. We’re similarly seeing the need for greater productivity, particularly for algorithmic-type work.
McCloud: One of the most important things is that when these new synthesis tools come out they need to build on the existing RTL flows. One of the mistakes we see is when a tool tries to replace RTL synthesis. That’s reinventing the wheel, and there are already years of R&D behind RTL synthesis. It’s more important to have high-level synthesis built on top of existing RTL methodologies. Even today, gate-level design hasn’t gone away. That helps reduce risk. RTL is a necessary stopping point. It gives the security it’s working.

SLD: What will synthesis look like in the future?
McCloud: I think we’ll see more and more technology built inside and leveraged.
Cline: I agree that leveraging the technology people already have put into a design flow is important. That’s a strategy we’ve employed. When you go to a customer that’s had an ASIC signoff flow with Synopsys or Magma or anyone else and they are going through to gates you don’t disrupt that. But we also believe you can make better decisions inside the high-level synthesis tool if you know more about what logic synthesis is going to do with it. We don’t specifically go down to GDS II as part of our flow. But there are some tricks you can do if you have an embedded logic synthesis engine inside of your high-level synthesis tool to bring some of the higher-end pieces of the RTL synthesis up.
Kaul: You need a robust RTL platform to be able to take the RTL and go down. For large parts of the design, RTL synthesis is pretty stable, but for the very large designs, it’s becoming deficient. You have designs that are 50 million or 100 million gates. Most logic synthesis tools go from RTL to a mapped netlist, and then they optimize the gates and map it to a technology platform. Then you do physical synthesis after that. There can be 50 or 60 blocks that do not quite match your physical partition. So now you’re managing constraints, you’re dealing with very long runtimes, and you’re talking about closing the loop with physical implementation. When you move up to high-level synthesis, one of the things that would be useful is a capability that allows you to get quick feedback on whether your architecture is going to work or not. Is the QR going to be good enough? Is the implementation going to be good enough? Tools have improved a lot, but you need to know whether the chip architecture is going to be good enough. As the chips get larger, it’s harder to get a good answer.
Eddington: High-level synthesis has to utilize the key technologies in the RTL flow. It has to know what the RTL logic synthesis is going to do in order to make its decision. RTL logic synthesis is not going to go away. It’s part of the design flow and well established, and you’re crazy to try and change those. The high-level synthesis tool has to plug right into that. But it also has to have the right information to make its architectural decisions. That’s all about leveraging the downstream flows and making sure that information is brought back up.

SLD: How do you define synthesis vs. high-level synthesis?
Eddington: Synthesis is logic-level decisions and optimization. High-level synthesis is architectural-level optimization that takes a behavioral description and brings it down to a logic-level description.
Kaul: From RTL down is logic synthesis. Above that is high-level synthesis.
McCloud: To me the difference is in the source. A source to high-level synthesis is, by definition, untimed. It doesn’t contain any registers or clocks or I/O timing. In RTL it’s fully timed. It has a register with logic in between. That definition is part of the source. Another difference is the set of building blocks. High-level synthesis tools are working with adders, multipliers and muxes. RTL synthesis is dealing with gates. The process of inserting time is the process of scheduling.
Eddington: I disagree with one point. Some descriptions are approximately timed and have a notion of latency. They are very valuable ways of describing something at a high level that can be synthesized.
Cline: There are certainly people in the market claiming to have high-level synthesis and it’s debatable whether that’s really high-level synthesis.
Eddington: But even in the C environment where SystemC is defined as approximately timed and untimed. I consider the untimed is high-level synthesis, for sure. But approximately timed is also high-level synthesis because it isn’t describing all of the clocks and logic-level architectural details.
Cline: High-level synthesis is a bad term. Behavioral synthesis is better but it got a bad rap a couple years ago, so people are trying to stay away from that. And so if we’re high-level synthesis, what’ the next thing that comes out? Is it higher-level synthesis?

SLD: So how much of an SoC design is done with high-level synthesis these days?
Cline: It depends on the customer. The possibility is 100%, other than processor, bus and analog. One multifunction printer company does everything that isn’t analog, bus or processor in SystemC with our tool. A lot of it is re-use from the past. You can describe all that stuff in SystemC. Some of it may look pretty low level and it can be done with RTL synthesis. It’s not what we recommend, but let’s face it. Designs have all different levels of abstraction. You need a way to handle everything.

SLD: It sounds like the definition is far from clear-cut.
Cline: Some of us disagree with the definition. We output RTL. That’s our job, and our customers go off and use that RTL. In the real world, they have to go off and use some tool and they get gates out of it. What we’re finding is that when they use a tool that combines Talus or DC-Topo into place-and-route, as well, then they get some more advantages in optimizing. You have one pinch point in the middle, which is RTL, which is our common language. I don’t think the definitions matter. But there is a real sense in the market that a different task is taking place. It doesn’t matter what you call it.
Biddle: We have a handful of customers that are using SystemC. We’ve been partnering with Mentor on making sure the structures they’re putting out can be synthesized and that we can build a flow. Ultimately, one of the big bottlenecks we see is formal verification. Most people rely heavily on formal verification today. There are a lot of techniques and tools.
McCloud: There’s no single approach for verification. There’s classic functional-based simulation as well as prototyping and emulation. Calypto has a product out called SLEC that does do C-to-RTL verification.

SLD: What happens at the next level above this?
Cline: The RTL market has a well-defined flow and there are tools that plug into this flow. At the ESL level above that there’s a lot of interesting stuff—everything from virtual prototyping to formal verification. But that hasn’t jelled into a flow that everyone uses. Everyone picks their parts and puts them together.
McCloud: You are starting to see some parts of the flow solidify. We’re seeing that in the way that people verify designs. You develop a C description, you spend as much time as you can verifying that C description is correct, and what you’re generating is something that hopefully is predictable. You want to be able to re-use it, not re-do it at the RTL level.

SLD: How accurate are RTL synthesis and high-level synthesis?
Biddle: Everyone is always demanding greater predictability. There is constant pressure to correlate better. Once you have a netlist, what will area and timing really look like? That’s forcing us to add more physical modeling. But where we’re starting to have problems, particularly as we move to 40nm and 28nm, is how long it’s going to take to close timing. Once you come out of synthesis, are you done? We see designs where they’re throwing in gates just to meet timing, so chips are getting bigger than then need to be.
Kaul: For a large chip at advanced nodes, doing timing without understanding the placement is becoming less prevalent. You have to have placement information to have accurate timing. But for larger chips, understanding the floor plan is very important. Where the block goes in relation to the rest of the chip becomes very important. You have to look at global wiring, congestion and timing convergence. It’s important for the synthesis tool to look at all of that.
Eddington: There’s more innovation available in physical synthesis—integrating these placement and logic synthesis optimizations and doing them better. We’ve been adding physical synthesis in the FPGA space, but it’s still an area that needs further innovation.