Published on November 04th, 2011
No modern, synchronous integrated circuit (IC) can function without a clock network. Clock-tree synthesis (CTS), which distributes clock signals to data registers, is a critical step in the IC physical design flow. But the traditional approach to CTS, which is based on minimizing clock skew and is separate from logic optimization, is breaking down for advanced-node, high-speed designs. The problem is that today’s chip designers are facing a growing “timing gap” between the ideal clocks assumed pre-CTS and the more accurate propagated clocks, which emerge after CTS. This timing gap is driven by factors like clock gating for low power, on-chip variation, and design complexity. At 40/45 nm, the clock timing gap can be as much as 50% of the clock period. This makes it difficult for designers to close timing and impossible to fully optimize designs for timing, power, and area.
Clock-concurrent optimization is a new technology that bridges this timing gap. It makes CTS timing-driven rather than skew-driven while merging it with logic optimization. This approach simultaneously optimizes both clock and logic delays using a single cost metric. Observed results on high-speed processor designs have included clock-tree power reductions of 30%, clock-tree area reductions of 30%, and chip performance improvements to 100 MHz for a gigahertz design.
There’s a close analogy between the need for clock-concurrent optimization today and the emergence of physical synthesis and timing-driven placement 10 years ago. At that time, there was a divergence in timing between register-transfer-level (RTL) synthesis and placement resulting from relatively inaccurate wire-load models. Without any knowledge of placement, synthesis tools could not measure timing with enough accuracy. The solution was to make placement aware of timing. Placement was then combined with many of the logic-optimization techniques that were exploited during RTL synthesis.
Today’s divergence in timing occurs at the CTS step. Clock-concurrent optimization is the answer. This new technology is a fundamental reinvention of CTS that will be needed in any IC physical design flow targeting advanced nodes (32/28 nm or below), gigahertz speeds, and/or embedded ARM® cores, such as the Cortex™-A9 or Cortex™-A15. It is best provided as an integral part of an end-to-end, unified Silicon Realization flow—not a point tool. This article shows why this technology has become necessary, how CTS works in today’s flows, what its shortcomings are, and how clock-concurrent optimization paves the way for a new generation of complex, high-speed systems-on-a-chip (SoCs).
A Brief History Of Clocking
Clocking was a great innovation that made high-speed digital-IC design possible. It elegantly quantizes time, making it possible to abstract transistors into sequential state machines. Those machines can then be described at the RTL. But complex clock networks are a relatively recent innovation. When chips were running at speeds around 1 MHz 30 years ago, the “clock” was just a wire and CTS didn’t exist.
Wires didn’t scale with transistors. Plus, the ability to distribute the clock signal got weaker over time. Designers started building structures that distributed and re-amplified the clock signal 20 years ago, bringing CTS into existence. An uninvited guest to this party was clock skew, which arises when it isn’t possible to distribute the clock signal to every register at exactly the same time. As clock networks got bigger, the impact of skew on performance became more and more significant.
Today, clocking is extremely complex. Clock networks may include more than 100 interlinked clock signals that can be branched and merged thousands of times. Clocks may reside in diferent power domains that are shut down when not active. The clock network has a substantial and growing impact on a chip’s overall power, performance, and area. The time has come for a new paradigm for CTS and logic optimization.
Where CTS Fits Today—And How It Falls Short
Traditionally, clock-tree synthesis distributes source clock signals to thousands of data registers. At the same time, it balances skew so that signals (supposedly) arrive at almost exactly the same time. CTS occurs after intitial placement and before routing. The real function of CTS is to bridge the gap between two levels of timing: the “ideal” clocks that are used before CTS, when there are no actual clocks, and the “propagated” clocks that are used to directly model timing post-CTS. Figure 1 shows CTS in a traditional IC physical design flow.
Figure 1: Clock-tree synthesis bridges the gap between ideal and propagated clocks.
To further explain the difference between ideal and propagated clocks, consider Figure 2. It depicts setup constraints, which ensure that every flip-flop takes one step forward when the clock ticks. It also illustrates hold constraints, which ensure that no flip-flop ever makes more than one forward step.
Propagated Clocks Model Of Timing
Setup Constraint: L + Gmax < T + C
Hold Constraint: L + Gmin > C
Figure 2: Shown are setup and hold constraints in a clock-based design.
In Figure 2, L is the clock delay to A for the launching clock while C is the clock delay to B for the capturing clock. Gmin and Gmax are the mininum and maximum logic path delays (respectively) between the two flip-flops. The setup and hold constraints indicate a propagated clocks model of timing. The word “propagated” comes from the fact that the constraints start from the root of the clock. They include the time taken for the clock edge to propagate through the clock tree to each flip-flop.
An ideal clocks model of timing simplifies the propagated clocks model by assuming that the launch and capture clock paths have the same delay (that is, that L = C). In this case, the setup and hold constraints are significantly simplified:
Ideal Clocks Model Of Timing
Setup Constraint: Gmax < T
Hold Constraint: Gmin > 0
With ideal clocks, there’s no need to worry about clock delays or minimum logic delays. All that matters is making sure that the design’s maximum logic path delay—typically referred to as the “critical path”—is faster than the clock period. In essence, clocks have been canceled out of the timing-optimization process.
Because ideal clocks assume L = C for all setup and hold constraints, it’s not surprising that the traditional CTS role is to balance clocks such that L = C. However, clock skew and the worst-case difference between L and C aren’t the same thing. Clock skew is actually the difference between the shortest and longest paths between sources and sinks. In nanometer ICs, it’s possible—and very common—for (L - C) to be significantly greater than the clock skew.
Essentially, the ability of tight clock skews to bind ideal clock timing to propagated timing is broken. The only solution is to give up entirely on the concept of skew, make CTS timing-driven, and combine it with logic optimization. That’s the essence of clock-concurrent optimization.
The Clock Timing Gap
Just how bad is the timing gap between ideal and propagated clocks, and how did it arise? Figure 3 summarizes the average clock timing gap for the top 10% worst violated setup constraints across a portfolio of more than 60 real-world, commercial chip designs from 180 to 40/45 nm. The designs had from 200K to 1.26M placeable instances. At 180 nm, the clock timing gap is small—around 7% of the total clock period. At 40/45 nm, however, the gap widens to around 50% of the clock period.
Figure 3: The clock timing gap widens at lower process nodes.
At 40/45 nm, this timing gap is wide enough to completely transform the timing landscape of a design beyond recognition between the pre- and post-CTS phases. The only solution is to directly target the propagated clock timing constraints. The launch (L) and capture (C) clock paths should be treated as optimization variables with the same significance—and degrees of freedom—as the logic path variables Gmin and Gmax.
Why has the timing gap become so significant at advanced process nodes? Three factors are contributing to this effect:
· On-chip variation: OCV arises because the performance of supposedly identical transistors can vary by unpredictable amounts. At 45 nm, these random variations can change logic delay paths by up to 20%. Even if the impact of OCV is only 10% of the path delay, it still amounts to a potential gap of 30% to 50% of the clock period between ideal and propagated clocks. Traditional measures of skew ignore OCV.
· Clock gating: With today’s demand for low-power electronics, most modern IC designs use clock gating to shut down clocks that aren’t needed in a particular clock cycle. Every clock gate added to a design adds to that design’s clock timing gap. After all, clock gates are inside the clock network and therefore can never be balanced relative to the registers.
· Clock complexity: Large SoC designs typically include a dense spaghetti network of clock muxes, XORs, and generators entwined with clock gating elements from the highest levels of the clock tree to the lowest levels. A tree or set of trees may include a network with hundreds of sources and hundreds of thousands of sinks. Achieving zero clock skew in such networks is often impossible. Or the power cost of doing so is unacceptable.
The idea behind clock-concurrent optimization is that the only meaningful goal for clock construction is to directly target the propagated clock timing constraints. L and C should be selected specifically for the purpose of delivering the best possible post-CTS timing picture. Clock-concurrent optimization merges physical optimization with CTS and directly controls all four variables in propagated, clock-timing-constraint equations—L, C, Gmin, and Gmax—at the same time. As depicted in Figure 4, clock-concurrent optimization gives up on the idea that skew is fixed. It uses L and C as variables to achieve the best possible timing.
Figure 4: Clock-concurrent optimization allows variation in the launch (L) and capture (C) paths, allowing better timing and power optimization of IC clock networks.
Because both clock and logic delays are flexible parameters, the maximum possible speed at which a chip can be clocked is no longer limited by the design’s slowest logic path (Gmax). The capture clock path can be longer than the launch clock path or vice versa. Essentially, time can flow forward or backward to the next or previous logic stage, respectively.
Such “time borrowing” can be iterative across multiple logic stages. If time can be borrowed from logic stage n + 1 to logic stage n, time also can be borrowed from logic stage n + 2 to logic stage n + 1, and then again from logic stage n + 1 to logic stage n, and so on. It can be borrowed both forward and backward from logic stage n.
The time borrowing isn’t unlimited, however. It must stop either when the chain of logic stages loops back on itself or when it reaches an I/O to the chip. In a world where launch and capture clock paths are flexible optimization parameters, these “chains” of logic functions most influence the maximum possible clock speed. At most, a chain with n logic stages has n clock periods of total time available—irrespective of the clock delays to each register in the chain. Provided that the worst-case total logic delay through a chain of n stages is less than n times the clock period, it will be possible to come up with a set of clock arrival times for each register on the chain that closes timing.
It’s important to not confuse clock-concurrent optimization with “useful skew”—a concept that is now applied to registers either before or after clocks are built. If useful skew is applied pre-CTS in the flow, the timing slacks on which it is based are ideal clock slacks. They are therefore inaccurate. If useful skew is applied post-CTS in the flow, the clock tree is already built and can only be tweaked in small ways.
The goal of clock-concurrent optimization—in contrast to useful skew—is to build clocks globally from the ground up based on propagated clock slacks. Of course, this is a paradox, as propagated clock slacks don’t exist until after clocks are built. The best possible compromise must therefore involve an iterative process that attempts to approximate this goal over a series of steps, where each step builds in some way on knowledge from the previous step. Of course, if the clocks are being built based on slacks, it’s vital that other things that affect slacks become part of this iterative process too—hence the term clock-concurrent optimization. Examples include traditional logic optimization and timing-driven placement.
Clock-concurrent optimization thus brings clock scheduling and physical optimization together in a single iterative step. It works with clocks and logic simultaneously to achieve the best possible result. As shown in Figure 5, clock-concurrent optimization replaces both the CTS and post-CTS physical-optimization steps in the traditional IC design flow.
Figure 5: Clock-concurrent optimization is shown in the IC physical design flow.
The key benefits of clock-concurrent optimization include:
· Increased chip speed or reduced chip area and power: At 65 nm and below, the achievable increases in clock speed can be as much as 20%.
· Reduced IR drop: The peak current drawn by the clock network is significantly reduced.
· Increased productivity and accelerated time to market: The lack of a requirement to balance clock-tree constraints may take a month off the SoC design cycle. Furthermore, clock-concurrent optimization results in significantly fewer iterations between the front-end and back-end design teams.
· Accelerated migration to 45 nm and below: Clock-concurrent optimization eases the worsening challenge of timing closure while easing the move to advanced process nodes.
In closing, traditional approaches to clock-tree synthesis are breaking down due to effects like on-chip variation, clock gating, and clock complexity. The timing gap between the ideal timing models used pre-CTS and the propagated timing models used post-CTS is growing. And that gap is resulting in sub-optimal solutions with respect to performance, power, and area. A CTS strategy based on balancing skew is becoming unworkable and counter-productive.
Clock-concurrent optimization turns the focus away from skew balancing, makes CTS timing-driven, and combines it with physical optimization. Instead of trying to eliminate all skews in arrival signals, it allows skew to vary within timing windows. In doing so, it allows some logic functions to run more slowly and others to run more quickly. Clock-concurrent optimization has demonstrated significant performance, power, and area gains in real designs.
Clock-concurrent optimization is much more than a point tool. It is a capability that must be part of any advanced-node IC design system. Following its acquisition of clock-concurrent optimization pioneer Azuro, Cadence Design Systems has brought this technology into its Encounter® Digital Implementation System. Over time, all vendors will need to move to this new paradigm. In doing so, they will reshape the IC physical design flow as surely as the move to physical synthesis and timing-driven placement did 10 years ago.