Optimizing Embedded Memory for the Latest ASIC and SoC Designs
Introduction to Embedded Memory IP
The allocation of physical real estate (die area) of typical large ASIC and SoC designs tends to fall into three general groups:
- Die area dedicated to new custom logic
- Die area dedicated to reusable logic (3rd-party IP or legacy internal IPs)
- Die area used for embedded memory
As Figure 1 shows, while companies continue to develop their own key custom blocks that help to differentiate their chips in market (like wireless DSP+RF for 802.11n, Bluetooth, and other emerging wireless standards), and third-party IPs (such as USB cores, Ethernet cores, and CPU/Micro-controller cores) occupy a fairly consistent percentage of die area, the percentage of area used for embedded memory is increasing dramatically.
|Figure 1: Embedded memory is using a larger percentage of the total available die area in today’s ASIC and SoC designs.|
According to data from Semico Research, in 2013, the majority of SoC ASIC designs allocate over 50% of their die area to various embedded memories. In addition, there is a wide variety in the purpose and ideal characteristics of the many embedded memories in a large SoC, as seen in Figure 2.
|Figure 2: Multiple embedded memory IPs in multicore SoC|
Consequently, it is very important for designers to have access to a variety of memory IP, so they can optimize their design using the right type of memory for each purpose within the SoC. Selecting the proper mix of memory IP types enables designers to optimize performance parameters such as speed, power consumption, area (density), and non-volatility.
Key Design Criteria for Embedded Memories
Figure 3 shows five key driving factors for selecting the best memory IP for each application within the design:
|Figure 4: Key factors for memory IP selection|
The best solution will be some tradeoff between characteristics in each of these dimensions. In many cases, a specific memory IP with an optimized mix of characteristics can be automatically generated by a memory compiler that uses these drivers as input to the memory design generation process. It is also important that the memory IP support infrastructure supports a solid verification methodology, and that the generated IP is maximized for high yield. Finally, to achieve the best productivity and quality, the memory compiler should generate GDSII directly, without manual intervention or tweaking. Other features to look for include well designed margin controls, support for automatic test pattern generation and built-in-self-test) (BIST). Also, an ability to benefit from single stepping of BIST for silicon debug is highly desirable.
Powerful compilers with advanced circuit design can achieve ultra-low dynamic power (CV2f) and minimum leakage power by taking advantage of techniques such as multi-banks, advanced clocking, biasing methodology, controlling the Leff characteristic of transistors, and optimizing for multiple supply voltages (VTs). Designers can combine these memory techniques with voltage and frequency scaling and multiple power domains to achieve an optimum result.
It is paramount to leverage advances in design methodology to achieve best-in-class memory performance. Designers need a memory compiler that allows them to trade off speed (for example, access time or cycle time), area, dynamic power, and static power (leakage) to obtain the optimum combination for the desired application. Memory blocks can also benefit from incorporating multiple VTs, multi-banking, and multiple bit cell choices, while complementing these selections with energy-efficient design techniques that also enable high speed.
Reliability and Yield
Drastic decreases in transistor dimensions and power supply have significantly reduced noise margins, affecting the reliability of very deep-submicron chips. ECC and redundancy are needed to improve yield and increase operational reliability.
Because today’s SoCs have very large aggregate bit counts, embedded memory is the most important determinant of SoC yield. Dedicated test and repair resources are critical to increasing memory yield, reducing time-to-volume and containing test and repair cost. Memory IPs with a built-in self-repair capability based on one-time-programmable memory technology can repair the memory array if bits fail once the chip has been manufactured. Ideally, the repair capabilities of the memory compiler should be tightly integrated with silicon test tools, so that repairs can be programmed immediately during the production testing process.
It is important for designers to have the choice of using either a foundry bit cell or design their own, depending on the requirements. Where custom design is needed, it is extremely helpful to work with an embedded memory supplier who understands custom design and can provide silicon data for various process nodes. With advanced design techniques, no additional mask and process modification should be needed to achieve the highest yield and reliability.
Having a choice of memory density for various process nodes is an important consideration in the choice of memory IP. Advanced memory compilers allow designers to trade-off density and speed, for example, choosing between high-density (HD) bit cells or high-current (HC) bit cells.
Features such as flexible column multiplexing also enable designers to optimize SoC floor planning by controlling the shape of the memory footprint (variable width, variable height, or perfect square) to minimize memory impact on overall die size. Some memory compilers also support features such as sub-words (bit and byte writeable), power mesh generation for the most optimum power delivery. In addition flexible ports (one port can be for read or write, and the second port for read and write) can save area in SRAM, CAM and register-files.
Figure 4 illustrates the density relationship between two embedded memory IP architectures. The one-transistor (1T) bit cell provides up to a 50% reduction in die area for a given bit capacity compared to the six-transistor (6T) bit cell. For designs that have moderate speed requirements and a need for high density, the 1T architecture is an ideal choice, and has a beneficial impact on cost, since it can be implemented using a bulk-CMOS process with no additional mask steps. For high-speed applications, designers can use 6T, or even 8T, bit cells to meet their speed requirements.
|Figure 4: Memory density scaling with different embedded memory IP architectures|
For SoC ASICs, designers want to select the IP combinations that achieve “area saving” in comparison with sub-optimal IPs (frequently referred to as “Free IPs”) to achieve the greatest cost saving. While there are many free memory IP options available to designers, these are not always the most economical solutions in terms of overall product profitability. In many cases, the improved density and performance of licensed embedded memory IP can result in significant manufacturing cost reductions that far outweigh the cost savings from “free” memory IP.
Table 1 illustrates how the choice of designing with optimum memory size, can affect volume cost, over the life of the product. The table uses the percentage of die area occupied by memory IPs. The die cost, the volume run rate and the product life to calculate the cost saving of higher density memory. The estimated IP area savings are based on Figure 4, which shows a roughly 2:1 increase in density for the 1T memory versus the 6Toptions.
|Table 1: Higher Density IP and Cost Savings|
What to Look for in Embedded Memory IP
To give you an idea of some the options you have in your memory design, here is a survey of fee-based embedded memory types with some of the most advanced features available.
Single (6T) and dual port (8T) SRAM IP
Static RAM memory blocks based on traditional 6T storage cells have been the workhorse of ASIC/SoC implementations since these memory structures typically fit right into the mainstream CMOS process flow without requiring any additional process steps. The 6T cells are ideal for large program or data memory blocks, with the best based on production-proven, foundry-provided 6T/8T bit cells for high-speed and low-power designs. The 6T memory cell can be used in memory arrays ranging in capacity from a few bits to multiple megabits.
Memory arrays using this structure can be designed to meet many different performance requirements, depending on whether the designer opts to use a CMOS process optimized for high performance or low power. High-performance processes can yield SRAM blocks that have access times well below 1ns at advanced process nodes such as 40nm and 28nm, while achieving low power consumption. As feature sizes shrink with more advanced process nodes, static RAMs built using the traditional 6T memory cells can deliver even shorter access times with smaller cell sizes.
The static nature of the SRAM memory cell keeps the amount of support circuitry to a minimum, requiring just address decoding and enable signals to design the decoder, sensing, and timing circuitry.
Single port (6T) and dual port (8T) register file IP
These register file memory IPs are a good option for fast processor caches and smaller memory buffers (up to around 72Kbit per macro. Registers achieve the smallest area with the fastest performance.
Single-layer programmable ROM IP
This configuration is relatively low power and high speed, and is great for size-efficient storage of microcode or fixed data storage, applications which are steadily increasing. An IP with support for multiple banks and different aspect ratios enables smaller die size and the best speed. For fast design turnaround-time, some offerings provide a programming script language to drive the memory compiler.
Content-Addressable Memory IP
These IPs are typically used as TCAM (ternary) or BCAM (binary) IPs for search engine applications because they are faster, require less power, and use less die area compared to algorithmic approaches for search-intensive applications. Often, searches can be completed in a single clock cycle. TCAM and BCAM are commonly used for packet forwarding, Ethernet address filtering, router lookup, firmware search, host ID search, memory de-duplication, directory compression, packet classification and multi-way cache controllers.
Single transistor SRAM
This configuration provides very high density with moderate speed, and is available in 180 nm, 160 nm, 152 nm, 130 nm, 110 nm, 90 nm, and 65 nm processes. It is especially suitable for ASIC/SoCs applications that require large amounts of on-chip storage—typically more than 256 Kbits—but do not require the absolute fastest access time, and for designs with limitations in the area or leakage current consumed by the memory blocks. This configuration generates memory arrays that work like SRAMs, but are based on a one-transistor/one-capacitor (1T) memory cell (as used in dynamic RAMs).
Single transistor SRAM arrays can deliver higher capacity in the same chip area as a 6T-based memory array, but require that the system controller and logic be aware of the dynamic nature of the memory and take an active role in providing the refresh control. In some cases, it may be possible to wrap the DRAM array with its own controller to make it appear like a simple-to-use SRAM array. By combining the high-density 1T macro with some support logic that provides the refresh signals, the dynamic nature of the memory cells can be made transparent, and designers can treat the memory block as if it were a static RAM when implementing their ASIC and SoC solutions.
1T SRAMs are available from several foundries as licensable IP. However, some of these IPs require extra mask layers (in addition to the standard CMOS layers). This requirement increases the wafer cost and can limit the foundry choices for fabrication. To justify the extra wafer processing cost, the total DRAM array size used in a chip must typically be more than 50% of the die area. Most of the available DRAM macros are hard macros, with limited choice in size, aspect ratio and interfaces.
A special variant of the single transistor SRAM uses an architecture that can be manufactured with standard bulk CMOS processing, and does not require any mask modifications or additional processing steps. Such an IP macro is more cost-effective (saving 15-20% in processing costs), and can be processed in any fab, or transferred from one fab to another for cost or capacity reasons. This solution is available in a variety of sizes, aspect ratios, and interfaces, each of which can be specified to the associated memory compiler. The resulting memory block interface looks almost like static RAM to the rest of the system, but can achieve about two times the density (bits/unit area) vs. memory arrays based on 6T cells (after averaging in the support circuitry overhead as part of the area calculation). For larger memory arrays, the percentage of overall area required by the support circuitry will be less, and the memory block will be even more area-efficient.
Memory Compiler Tools
Tailoring the base IP memory macros to the exact requirements of a specific memory application is the job of the embedded memory compiler. The most flexible compilers allow designers to choose the best architecture and automatically generate memory arrays with the exact combination of speed, density, power, cost, reliability, and size needed to optimize the application. The compiler automation reduces non-recurring engineering costs and potential errors associated with manual array optimization. Compilers enable customers to use the most ideal core size, interface, and aspect ratio, while also helping them to achieve the shortest time to market. They also provide designers with electrical, physical, simulation (Verilog), BIST/DFT model, and synthesis views of the memory array as a part of the compilation process.
|Table 2: Commercially Available Examples of Embedded Memory IP|
Selecting the optimal embedded memory IPs for new ASIC/SOCs is a critical design decision. Designers should be aware of all the key dimensions of the best memory characteristics for their specific application, and look for memory IP that provides the flexibility needed to match the range of requirements within the target SoC. While free memory IP is readily available, it does not always provide the best solution when compared to fee-based IP that provides better characteristics for the specific application.
Highly-tuned memory IPs with smaller size, lower leakage, lower dynamic power, or faster speed can provide designers with a more optimized solution that can potentially save millions of dollars over the life of the product, and better differentiate their chips in a highly competitive ASIC/SOC marketplace.
Farzad Zarrinfar, Managing Director of the Novelics Business Unit at Mentor Graphics, has over 30 years of industry experience in global semiconductor and IP companies. He was President and CEO of Novelics when it was acquired by Mentor in 2011. Previously, Farzad was vice-president of worldwide sales for strategic accounts at ARC International, and vice president of strategic marketing for SONY Semiconductor’s broadband IC business. He holds a B.S.E.E. from San Diego State University, M.S.E.E. from Southern Methodist University, and an MBA from the University of Phoenix. He may be reached at email@example.com.