Published on April 17th, 2008
On-chip thermal analysis calculates and reports thermal gradients or variations in operating temperature across a design. This analysis is increasingly important for the advanced digital integrated circuits (ICs) created at today’s 90- and 65- nm technology nodes. In fact, it is anticipated that this form of analysis will be mandatory for designs implemented using 45- and 32-nm technologies. Increasing power density, multifunction designs, and the use of advanced low-power design techniques all lead to increased on-chip temperature gradients. These temperature gradients must be accounted for during critical chip analyses including leakage power, chip timing, signal integrity (susceptibility to crosstalk and noise), and electromigration. If they aren’t accounted for, analysis errors will escalate. In doing so, they’ll force design teams to increase design margins to compensate for the rising inaccuracies.
A number of design challenges and technology factors are leading to increasing concern with regard to on-chip temperature gradients. In terms of the technology factors, power density (power per unit area) is increasing with each new technology node. After all, smaller geometries enable more functionality to be fit within the same area of a die. This aspect enables design teams to commit to larger and larger designs. At the same time, it significantly increases heat generation, which can result in high thermal gradients. Figure 1 illustrates how the total power dissipation of designs is increasing with every new generation. The figure also shows a comparison of the relative contributions of the different power sources.
It’s common to think that more functionality simply equates to greater power consumption. It’s less obvious that different functions have different thermal profiles. A central-processingunit (CPU) core tends to run relatively hot while on-chip memory tends to run relatively cold. In addition, the temperature of special-purpose hardware accelerators and cores, such as digital-signal-processing (DSP) cores, varies with activity. When each of these functions was implemented on a separate chip, each chip had its own well-defined thermal profile. In this case, it was sufficient to perform all of the analysis (timing, for example)—assuming a single temperature across the entire die.
When all of these functions are combined on a single multimode design, however, the result is an ever-varying mish-mash of “hot” and “cool” spots that depend on the mode of operation. A cell phone is a good example of this type of design. The act of creating a text message will exercise certain functionality, which creates a specific thermal profile. But the act of transmitting this message will exercise different functionality, which results in a different profile. The same can be said for using the cell phone to make a voice call, play an MP3 file, take a picture, and so forth. Each of these multiple thermal profiles will affect the timing, power consumption, signal integrity, and electromigration characteristics of the design in different ways and with different levels of severity. The resulting temperature variation across a chip is typically around 10° to 15°C. If it’s not managed, however, that temperature variation can be as high as 30° to 40°C.
In recent years, power management has moved to the forefront of application-specific-integrated-circuit (ASIC) and systemon- a-chip (SoC) development concerns. Here, the combination of higher clock speeds, greater functional integration, and smaller process geometries has contributed to significant growth in power. In an attempt to manage power consumption, a variety of design techniques have been developed to meet aggressive power specifications. These include (but are not limited to) the use of clock gating, multiple-switching-threshold (multi-Vt) transistors, multi-supply multi-voltage (MSMV), substrate biasing, dynamic voltage and frequency scaling (DVFS), and power shut-off (PSO).
In the case of MSMV, functional blocks that aren’t performance critical are run at a lower voltage and/or frequency to conserve power. As a result, they’ll run cooler than surrounding functions which are running at higher voltages and/or frequencies. This analysis challenge is further complicated in the case of blocks using DVFS techniques, where supply voltage and/or operating frequency may change dynamically over time. Each of the different combinations of voltage and frequency (coupled with the type and quantity of data processing being performed) will result in different thermal profiles for these blocks.
Perhaps the most extreme low-power technique (in the context of its thermal impact) is that of PSO, where blocks can be completely shut down to conserve power when they’re inactive. Obviously, the thermal characteristics for such blocks depend on their state. Given an increase in the number of low-power designs featuring multiple operating modes, it’s necessary to calculate and account for these mode-specific thermal profiles.
Power dissipation—realized as heat—comes from a combination of switching power and leakage power. Switching power is a function of logic toggle rates, buffer strengths, and parasitic loading while leakage power is a function of the process technology and device characteristics. Thermal-analysis solutions must account for both causes of power. They must understand how heat is conducted away from the heat sources in a design and how heat can build up in corners of the design where thermal barriers prevent dissipation. Figure 2 illustrates the flow with regard to generating a thermal map of a design.
Higher operating temperatures increase leakage, degrade transistor performance, decrease electromigration limits, and increase interconnect resistivity. Leakage increases exponentially with temperature as illustrated in Figure 3. In addition, increased leakage leads to increased power consumption with each new process node. For its part, degraded device performance impacts timing. It also can increase susceptibility to signal-integrity noise injection—especially in the case of aggressor devices that are running at lower temperatures combined with victim devices running at higher temperatures.
Higher operating temperatures also increase susceptibility to electromigration. The most critical case of electromigration will cause a design to catastrophically fail due to open or short circuits in the wires. Yet it’s more common that electromigration results in further increases in wire resistivity, which can impact both timing and signal integrity. To comprehensively address thermal issues, it’s necessary to have a unified system in which all of the major analysis components—parasitic extraction, timing, signal integrity, power, electromigration, etc.—are thermally aware.
The package serves several purposes. One key consideration is the package’s ability to effectively transfer heat away from the die. This heat transfer can be into the printed-circuit board (PCB) or through the outside of the package to the surrounding air (possibly via a heatsink). In some cases, more extreme techniques may be employed, such as a liquid cooling system.
Typically, increasing the package cooling capacity means raising package cost. If a design team doesn’t accurately manage heat dissipation and unexpectedly high on-chip temperatures become an issue, the team may be forced to use more expensive packaging. This additional expense wasn’t included in the initial profit-cost calculations. As a result, the project’s profitability may be significantly reduced or even totally eliminated.
Ball-grid-array (BGA) packages, which may include a single or multiple die, are the standard approach for high-pin-count devices. But different options exist for thermal management of the BGA. Thermal vias in the package substrate can provide a more efficient transfer of heat into the PCB. In addition, heatsinks can help transfer heat from the package to the natural or forced convection of air. In the case of very-high-performance devices, it’s becoming increasingly common to employ more sophisticated techniques, such as liquid cooling. Figure 4 reflects a typical FCBGA package with a lid and external heatsink. A good thermal model should account for the effects of all of the components in this illustration.
Hot spots occur not only on the die, but also on the package. This is due to the uneven distribution of power dissipation both on the die and in the substrate. The old approach of averaging out the power across the die and the substrate is no longer accurate enough. It’s now necessary to model power distribution across the die as well as individual structures in the package, such as substrate metal, wirebonds, bumps, and balls.
For accurate thermal analysis of the chip, a package-on-board thermal-analysis tool can calculate heat flux on all sides of the die. In doing so, it provides real-world boundary conditions. Heat balance is achieved when power on the chip is equal to the sum of these heat fluxes. Using an iterative approach, the applied detailed power distribution will refine the temperature map of the chip. That map, in turn, will be used to refine the package-on-board thermal analysis.
One alternative to this iterative process is to use a compact thermal-resistance model as the boundary condition for chip thermal analysis. The compact thermal-resistance model is based on the detailed package model. It is boundary-condition independent. The compact package model attached to the chip and connected to the board will correctly represent the heat exchange with the external environment via the package top and sides. Heat balance is achieved by accurately modeling the on-chip power generation, heat dissipation through thermal resistances in the package compact model, and heat dissipation though the package/board interface into ambient air.
The compact model could be a simple two-resistor model with Theta_JCtop (junction to case top) and Theta_JB (junction to board). The industry-standard DELPHI compact model is more accurate, as it has more than two resistors when representing complex package structures. This is a one-step process for accurate chip thermal analysis. But work still needs to be done to apply this approach to multi-chip applications.
The leakage current associated with a transistor rises as a function of increasing temperature. That higher leakage current causes more power to be dissipated, which further increases the temperature. Thermal runaway describes the condition in which the temperature rises because of the feedback loop between increasing leakage and increasing temperature. Assuming that the generated heat will at some point exceed the package’s capacity to remove it, the chip overheats and ceases to function. If the heat isn’t efficiently dissipated through the package via the I/O connections or the substrate, thermal runaway becomes a real possibility (see Figure 5).
Designers of extremely high-performance chips, such as leadingedge microprocessors, are therefore seeking innovative solutions to help dissipate the heat from problem areas. For example, one advanced design technique that’s already under consideration is the use of thermal chimneys—that is, physical structures created above and within the substrate to dissipate the heat away from the die. Designers also are starting to incorporate on-chip functionality to monitor chip temperatures and adjust the design if it starts to overheat—for example, to lower voltage and/or frequency (thereby reducing power dissipation and temperature) or to completely shut down the chip or system. As mainstream requirements continue to drive more and more power-hungry functionality into a single chip, these advanced thermal techniques are necessarily moving into mainstream use.
The solution to truly accounting for on-chip thermal gradient issues lies in two areas: the design tools and the design engineers. From the chip-design-tools view, critical functionality throughout a design environment must be thermal-aware. This includes RC extraction, timing, power estimation and calculation, signal-integrity analysis, power-grid analysis, electromigration analysis, and so forth. Furthermore, all of these tools should support multi-mode designs and their associated multi-mode thermal profiles.
In addition to the chip-design tools, the chip’s packaging design tools must provide accurate chip-package thermal models. They have to provide thermal boundary conditions that enable accurate on-chip thermal analysis. The combined chip-package design environment should support a full 3D thermal profile of the chip and package that accurately reflects local temperature differences on the die. Such differences are caused by device heating and power dissipation. In addition, the design environment should accurately model system-in-package (SIP) packaging including complex cases like 3D die stacking. It also should be able to handle advanced thermal-management techniques, such as the use of thermal vias.
From the designer’s point of view, it’s necessary to think carefully about the effects of one’s chosen design methodology from a thermal aspect. Design engineers must work closely with package characteristics or package designers to make sure that they identify and design to realistic, worse-case thermal scenarios. For example, ignoring the impact of low-power design techniques may result in unmanageable on-chip thermal gradients. In contrast, ignoring total power consumption may result in packaging that’s more expensive than initially planned.
Realistically, there are a limited number of controls to manage thermal constraints. They include architectural decisions, the use of multiple operating modes, the physical implementation of the design, and the cooling characteristics of the selected package. All of the decisions that impact these controls should be based on accurate thermal information, which includes the impact of thermal conditions on electrical analysis. Toward the end of a design cycle, the only option to address high thermal issues may be the use of a package with higher cooling characteristics. This proposition is an expensive one.
Finally, designers must understand the cooling complexities of chip stacking, the value of adding thermal vias (via chimneys designed to conduct heat away from a thermal hotspot), and the thermal characteristics associated with different implementation technologies. In the case of silicon-on-insulator (SoI), for example, the substrate has a harder time conducting heat away from the die.
In summary, development teams that are migrating or planning to migrate into the 45-nm process technology should immediately start to learn about accounting for on-chip thermal effects. Such effects are significantly more pronounced at the 45-nm node. Even teams creating high-performance designs at 65 and 90 nm should account for the thermal issues discussed in this article. After all, these high-power designs are more likely to suffer from high thermal operating conditions and the possibility of correspondingly high thermal gradients.