Posts Tagged ‘Methods’

Making A Multicore System Work

Thursday, January 29th, 2009

If you think designing a single-core system is hard, designing multicore systems is multiple times harder. Connecting all the pieces together and making them work properly, if not together, is one of the hardest tasks design engineers and architects will ever face.

System-Level Design tracked down some of the experts in this field and sat them down around a table to discuss what’s going on. Included in the discussion were

James Aldis, system on chip architect for Texas Instruments wireless business unit; Charles Janac, president and CEO of Arteris, Drew Wingard, CTO of Sonics, and Dave Gwilt, product manager for ARM interconnect products. What follows are excerpts of that conversation.

SLD: Let’s start with a really basic question. How do you define multicore?

Gwilt: We’ve been doing multiprocessing heterogeneous stuff for a very long time and in many different markets. Multicore is running a single software image across multiple processing elements.

Wingard: That doesn’t match what we see in practical systems.

Aldis: TI has been producing multicore chips for multiple generations now. We split the software into the piece that’s going to run on the RISC and the piece that’s going to run on the DSP and the piece of application processing that’s going to be offloaded onto a hardware accelerator. That’s all a very manual process. When I think of multicore these days I tend to think of what’s coming up in the wireless space where you have a single software image and it’s magically distributed over identical cores on the same device. But multicore means more than that.

Janac: There are a number of people who have tried to do the homogeneous multiprocessor kind of approach—similar to an FPGA. That works in some applications like defense and aerospace and networking, but it doesn’t work in cost-sensitive applications like wireless and consumer. As a result, we wind up with the majority of the market being heterogeneous multiprocessor SoC’s. Those are getting increasingly complex because the wireless carriers are constantly trying to deploy new applications and handset guys are trying to approximate the function of a PC. That’s putting increasing pressure on the hardware.

SLD: What do you actually gain integrating multiple cores, which share memory and busses, versus single-core chips?

Wingard: We’re doing these high levels of integration because we’re trying to get a certain amount of function at the lowest system cost and power and with the right amount of performance. We integrate not because we want to, but because Moore’s Law says we have so many transistors. It’s the job of the system architect to figure out how to make it work. In many cases, the thing that throttles these chips is that they have to share memory, but if you don’t share memory you don’t save costs. The personal computer space is driving DRAM road maps to give us increasing bandwidth per pin. Then we want to put the right amount of processing and bandwidth on the SoC so we can maximize utilization of that extra DRAM bandwidth. Some of this is also driven by form factor. You can’t do a multichip iPhone because there isn’t enough space inside the package.

SLD: Is the heterogeneous approach because each function requires different processing power?

Gwilt: Absolutely.

Janac: I was at a presentation where one gentleman said he was proud that his company was only using 7 percent of the ARM processor and that the rest of the system was running on these proprietary algorithmic engines. I wouldn’t be very proud of that.

SLD: So that’s 7 percent utilization?

Janac: Yes. They should be adding some intelligence that makes use of that resource and reduces the cost. One of the issues is how do you route the traffic to the cores that are available. What is the idle core doing? If it is idle, can you utilize it better?

Wingard: Today, in the battery-powered domains, they’re shutting off regions of the chip and turning off the power supply to several of the cores. If they don’t have anything to do, they’re shutting them off.

Janac: Or they’re putting them in a lower operating mode.

Wingard: All these games get played, but there’s an inefficiency associated with that. If you use heterogeneous cores, you can get better results. Your battery lasts longer. You can get higher performance. And you are much more able to support these multi-mode devices, which are still not general-purpose computers. PCs don’t do it this way because economics demand that you have a single software platform and you can run anything you want to pretty well. Application flexibility is much more limited. That doesn’t mean we don’t see clustered processors like the ARM MP core being useful for these applications. It’s still valuable to span a wider range of performance points by using some number of identical cores that you can schedule software across. ARM can scale an application, and the power associated with running that application, when you play with the voltage and the number of cores that are turned on.

Gwilt: That’s the key—using that to get power scaling across a broader dynamic range.

SLD: Didn’t TI do this with its DaVinci platform?

Aldis: Yes, we did. But there’s another aspect to all of this, too. The more open you make your platform, the more you end up in the PC world that Drew described. One thing we’re seeing in the wireless space with the advent of the iPhone and the mobile Internet devices that are coming through now is an emphasis on getting raw power out of the main processor and software portability. The wireless world, particularly at the high end, is becoming more and more like the PC world. This presents a challenge because just throwing gigahertz at something isn’t going to fly in the wireless world because of the constraints of power and form factor.

SLD: More and more, chip developers are trying to get multiple generations out of chips because of the cost of creating one. Is it harder to do with heterogeneous cores?

Janac: No, that’s where the interconnect comes in. If you have the right structure for the interconnect, you’re actually able to add in and back out IP in a much more cost- and time-efficient manner to get multiple derivatives.

Gwilt: That’s absolutely correct. Nowadays, with the type of interconnect technology that’s available, we’re able to build chips with very large numbers of cores and use the content of the cores that we require. We can choose those cores dynamically and maintain a highly optimized solution.

Wingard: There are some interesting examples where they take a subsystem, and within the context of a platform they implement that function in dedicated hardware or an optimized programmable processor. They get to higher performance and lower power that way. But in other versions of the same platform they move that same function into software. From the perspective of the application, the platform is the same. They’ve put in a layer of middleware that allows them to be agnostic. That makes it much easier to take this common platform definition and build different variants.

SLD: Your definition of interconnect is different than the historical one. This version seems to have logic built into it so you can optimize performance in multiple products.

Wingard: We want to put enough intelligence into the interconnect so that some part of the platform definition relies upon logic within the interconnect. What’s different about each chip is the set of IP cores, but there’s a set of common functions that are part of the platform definition. Some of those functions live within the interconnect—things like how do we enforce security and how do we manage to recover from errors. What scares me most about phones becoming more like computers is I really don’t like getting blue screens when I’m in the middle of a call. We expect stability in our appliances.

Gwilt: That same requirement for stability is also being driven by the need for integration. Our customers all want to pull together very significant platforms in very short periods of time. Having the ability to manage that stability through the interconnect is a valuable function.

Janac: If you use the interconnect to assemble these kinds of platform applications, you also need some automated and sophisticated tools for the design of the interconnect and for verification. It’s a matter of both the IP and the tools that come with it that are required for rapid time to market.

Wingard: The total amount of communication that we have to manage in the interconnect grows with the total number of components that have to be connected. But historically the fraction of the chip that’s dedicated to the interconnect and the main memory controllers has been remarkably constant across a wide variety of applications and design styles. Typically, between 8% and 12% of the die are interconnect and memory system components. As the chips get bigger, this is the part of the system that must change for each design. I can mix and match components, but the interconnect is going to be different every time. It is the most chip-specific IP, even in a platform definition. That’s why the automated tooling for this part of the design is so important.

SLD: But interconnects traditionally have been several steps after the initial architectural design. Has that changed?

Aldis: We’re now in our third generation of SoC platforms where we’ve known what our interconnects are going to look like—maybe not all the dots on the ‘I’s’ and crosses on the ‘T’s’ but we’ve known at a very early stage what we’re going to be using. We also know all the requirements we’re going to put on the different cores in the chip so they can plug into our interconnect environment. Nowadays, when we build a chip the interconnects are enabled before any of the cores. We have legacy cores, of course. But for any new cores, before we have working RTL we have an interconnect. This makes a huge difference between the time it takes to kick off a project and see the test cases running and starting to debug and analyze. We also have a System C model for the interconnect technology we’re using, as well. That’s part of the very initial architectural studies.

Wingard: This has a lot to do with the application domain that’s being targeted.

In those places where you put multiple cores together, you have to worry about the sharing behavior and performance. You quickly get the point that until you have a model of that system and you need to understand the implications of a shared memory and the interconnect that feeds it, you don’t know if you have an architecture that works. For those domains where it’s not, ‘Slap it together and we don’t care about performance,’ you absolutely have to have the interconnect technology and it has to be available very early in the architectural phase of the chip. There are many designers from ASIC background who aren’t used to that.

Hardware Prototyping Market Changes Form

Wednesday, December 17th, 2008

By John Blyler

How will the acquisition of ProDesign’s ChipIT business unit expand Synopsys market in the system-level rapid prototyping and possibly emulation space?

The short answer is that it’s probably too early to tell. But with the accelerating pace of EDA company consolidations, it’s important to quickly assess the pros and cons of each new acquisition.

Earlier this month, EDA and IP giant Synopsys said it had signed a definitive agreement to acquire the CHIPit business unit of ProDesign Electronic GmbH. ChipIT is a family of hardware-assisted verification and related software tools. This acquisition comes less than a year after Synopsys acquired Synplicity, allowing Synopsys to fully enter the FPGA implementation and debug market.

What does it mean?

With the acquisition of ChipIT and Synplicity, Synopsys now becomes a significant player in the fast-growing hardware-assisted verification market. The company brings additional strengths to this market with its existing verification technology, including IP and RTL simulation tools. At the system level, Synopsys also has a virtual prototyping platform for software development.

Hardware-assisted verification, you ask? Weren’t we talking about the rapid prototyping and perhaps emulation technology? Where does hardware-assisted verification fit into this mix? Although both FPGA-based rapid prototyping and emulation/acceleration are part of the same market segment – namely hardware-assisted verification – these two platforms are targeted at different markets.

“FPGA prototyping is being used mainly for IP verification and software development, where it runs small designs with great performance,” notes Ran Avinun, Group Director of Verification Marketing for Cadence. Conversely, emulation is being used for system and hardware-software verification of larger designs.

So is ChipIT a prototyping or emulation tool? Or perhaps both? Some have suggested that the ChipIT acquisition may be a way for Synopsys to enter the emulation space, through FPGA-based prototyping. Lauro Rizzatti, general manager of Emulation and Verification Engineering (EVE) USA, sees ChipIT as an FPGA-based rapid prototyping tool. “ChipIT is not really in the emulation business. It would take years for them to move from prototyping to emulation. We don’t think that is Synopsys’s intent, either,” explains Lauro.

Further, neither Cadence nor Mentor see FPGA-based (or hardware) prototypes as a direct threat or replacement for emulation because of performance and capacity differences. That is why each technology targets a different market, as noted earlier.

If ChipIT is really a prototyping hardware tool, then what unique features does it bring to Synopsys? The answer is transaction-based verification. ChipIT is an ASIC prototyping tool that uses the Standard Co-Emulation Modeling Interfaced – abbreviated SCE-MI - to perform high-speed, transaction level verification between different software simulation and hardware emulation systems. SCE-MI is the standard that allows the worlds of simulation, emulation and rapid-prototyping to interface.

“You can use C-Level models on the software side that connect to a device-under-test running on a hardware emulator,” explains Juergen Jaeger, Director of Product Marketing for the Synplicity Business Group. For example, C-level models run transaction-level data, such as an Ethernet package, whereas cycle-accurate hardware emulations run the actual data bits within the Ethernet package. Being able to run both levels of models enable system-level co-verification of a design.

Questions Remain

Synopsys’ acquisition of ChipIT would seem to strengthen its position in the system-level development market. Yet many questions remain. First and foremost is how Synopsys will integrate it most recent acquisitions of Synplicity and ProDesign’s ChipIT. For example, which of the two hardware platforms – Synplicity’s Hardi or ProDesign’s ChipIT – will it support, merge or remove? A similar question might be asked on the software side – Synplicity’s Confirma or ProDesign’s ChipIT?

Will ChipIT, a transaction-level tool, be used in conjunction with Synopsys virtual prototyping platform? Will this mean that Synopsys can now add behavioral synthesis and hardware-software partitioning to its existing RTL-based products. Behavioral ynthesis is a prerequisite for many system-level architecture activities.

When asked these questions, Juergen was careful to point out that Synopsys’ acquisition of ProDesign’s ChipIT was done to complement the earlier Synplicity purchase, not to overlap it. He was also quick to add that more news would be forthcoming, once the actual acquisition of ChipIT was complete.

Though too earlier to tell, this acquisition looks to have long-term implications for the chip design market.

Houston…We Have A System-Level Problem

Thursday, December 4th, 2008
YouTube Preview Image

Just imagine what happens when the guidance system on the International Space Station goes on the fritz and the entire lab begins doing somersaults through outer space. Then the solar panels no longer work and the communication system fails, and suddenly you understand how serious system-level design problems can become. Ret. Capt. Daniel Bursch recounts the incident from the safety of the Naval Postgraduate School in Monterey, Calif.

Smarter Robots

Thursday, October 30th, 2008

At the Naval Postgraduate School in Monterey, Calif., the goal of these engineering students and professors is to create autonomous robot designs—ones that can be preprogrammed so that nothing can interfere with their design-in purpose. Using Lego MindStorm components, all the normal EDA tools, an ARM processor and Actel FPGAs, the goal is to create prototypes for battlefield-hardened devices. The results in this video are primitive, but it’s a first step in one of the most complex system-level design.

YouTube Preview Image

Hardware/Software Validation

Thursday, October 23rd, 2008

In today’s competitive consumer electronics, missing a market window by even a few weeks can result in drastically limited sales. These cost and schedule-sensitive applications, however, are among the most challenging to create. Composed of many complex hardware blocks they typically include sophisticated digital circuitry coupled with large memories to provide advanced computational and multimedia capabilities. And being battery powered, they have stringent power restrictions despite the fact that each generation supports ever more features and capabilities.

With all the complexity associated with the hardware, the software is also crucial to the competitive success of these products. The application software often is the key differentiator for these consumer products, allowing the system company to reap substantial profit margins. Software is also key in the power and performance behavior of the hardware platform.

INTRODUCTION

In today’s competitive consumer electronics, missing a market window by even a few weeks can result in drastically limited sales. These cost and schedule-sensitive applications, however, are among the most challenging to create. Composed of many complex hardware blocks they typically include sophisticated digital circuitry coupled with large memories to provide advanced computational and multimedia capabilities. And being battery powered, they have stringent power restrictions despite the fact that each generation supports ever more features and capabilities.

With all the complexity associated with the hardware, the software is also crucial to the competitive success of these products. The application software often is the key differentiator for these consumer products, allowing the system company to reap substantial profit margins. Software is also key in the power and performance behavior of the hardware platform.

With traditional product development flows, the software team waits to validate their code on prototype hardware. While this approach worked well in the past, it fails under current technical and time-to-market pressures. According to industry research firm Venture Development Corporation, nearly 40 percent of project delays can be traced back to flaws in the system architecture design and specification. This problem exists because finding and fixing hardware/software design errors at the late, physical prototype stage is so difficult and time consuming.

Moving hardware/software validation earlier in the design flow enables both hardware designers and software developers to quickly model their designs, assess the functionality and design attributes of the entire system, and easily make changes that can pay huge performance, power consumption and system size dividends without endangering time-to-market deadlines. The conclusion is clear: starting application software and firmware development against a high- level hardware model can save significant development time, and yield products that meet or exceed consumer expectations.

CONDUCTING SOFTWARE VALIDATION EARLIER IN THE DESIGN CYCLE

A new system design methodology is emerging in response to this pressing need for earlier hardware/software validation. The approach is based on the creation of high-level hardware models that describe functionality in sufficient detail for the software team to use as a development platform at the earliest stages of hardware design. As a result, software developers can start their application and firmware validation from the initial stages of the design cycle, where changes are easiest and have the most impact on final design characteristics, and there is little risk of missing a market deadline.

The methodology is based on a scalable transactional level modeling (TLM) concept that describes the hardware in SystemC. A Scalable TLM approach provides benefits to both the hardware and software development. Not only can the software team begin coding much earlier in the design cycle, but TLM hardware descriptions provide much faster verification times – 100x or more – making it a viable solution for software development and validation.

On the hardware side, TLM allows for compact descriptions because the hardware system blocks are captured at a higher level and communicate by function calls, not by detailed signals, significantly reducing simulation time. The TLM model does not limit the design creativity of the hardware team. TLM also allows separating functionality from implementation. Hence, instead of forcing them to commit to hardware specifics early in the design cycle, the model simply describes the functionality of the hardware, not the details of how the hardware achieves that functionality. It also enabling incremental model fidelity for timing and power. In essence, the TLM model is independent of the hardware mechanics, allowing the hardware team to continually refine the design without having to constantly update the high-level virtual prototype.

At the same time, software development can align with the hardware development from the very earliest stages of the design cycle, allowing system interaction issues to be identified and resolved from the outset, dramatically minimizing the impact on the design schedule.

As a result, this methodology moves software/hardware integration into the electronic system level (ESL).

USING PROGRAMMER’S VIEW FOR SOFTWARE APPLICATION VALIDATION

TLM allow several abstraction levels, all of which support virtual prototyping and hardware/software. However, there are tradeoffs between TLM’s multiple abstraction levels. The very highest level of TLM, known as “Programmer’s View” (PV) level, is a good stage to begin software validation. At this stage, the SystemC hardware description does not include any timing information and therefore the simulation performance is extremely efficient—at least 1,000 times faster than at the RTLlevel. The TLM model contains sufficient information to describe the hardware functionality to support software application development.

Interface declarations are included so the software can connect with the hardware side. Specifically there are two kinds of interfaces: the first is a high-level methods interface with which the software engineer can call in his program. The method will “run” the hardware design and “returns” with the result value. The second is a bus cycle accurate interface based on memory-mapped registers on the hardware side allowing the hardware and software sides to interact through read and write transactions along with interrupts. Such hardware/software interface is achieved either by incorporating an ISS (Instruction Set Simulator) or using a host-mode technology which uses read/write implicit- access. An implicit access “captures” all the accesses to hardware by identifying the memory space calls. It allows software to run on any host processor (rather than the target processor) and simplifying the software programming since the software engineer does not need to instrument the code with any external API calls. Host mode execution often offers much faster simulation with slightly less accuracy vs. using the traditional ISS.

FIRMWARE DEVELOPMENT ENVIRONMENT

Traditionally software teams were forced to wait for a hardware prototype to develop the firmware because of the level of detail required for validation. However, using the TLM models this level of hardware/software interaction can now be moved up much earlier in the design cycle. At this point, the hardware team should “add” detailed timing information, since the behavior of the firmware can be influenced by the timing of the system.

Firmware development requires more accurate and detailed description of the hardware including timing information (in addition to the functionality description). Therefore the abstraction level is now bus-cycle-accurate. At that level software engineers can decide if they want to work on the target OS (in this case they will use ISS models accompanied with the SWdevelopment tools) or on any host OS of their choice in which case they will use bus-functional models and implicit-access functionality.

This enables the firmware code to interact through bus-functional models with the hardware design. Working in a host operating system environment of choice (as described above) using the cycle-accurate model, any read/write operation will be mapped to the hardware and interact with an actual address in the hardware. An example of this type of implicit access is: There are several specific debugging functionalities for firmware related verification tasks. For instance, the design team can manage both hardware and software environments in one IDE tool. They also can perform debugging operations, such as assigning breakpoints, on both sides and perform hardware/software transaction debugging. And they can view all the transactions (read/write/interrupts) and associated information in between hardware and software and break on any specific types of transaction or its parameters.

SELECTING THE RIGHT HW VERIFICATION METHODS LINKED WITH SW

When it comes to HWverification and debug, there are two usual approaches to this phase: The first approach involves the usage of ISS models and software development environments at the highest TLM level (fast ISS models) or at the cycle-accurate level as described in the previous sections. The second approach is emulation of software threads within the SystemC hardware design. As opposed to the previous methods where SWis linked through an ISS or host mode, with this method SWis embedded within the HWCPU model as additional SystemC threads that execute directly with the HWin one simulation process. This is used specifically for system performance exploration since it offers very high simulation speed while being less accurate with no support of RTOS. In that approach, which is used mainly by system architects, it is also possible to use “token-based” modeling which allow high simulation performance.

In the first approach The PVand the cycle-accurate model can also interact with SystemC verification solutions. They can be connected to existing ISS SystemC models—either at the PVlevel or cycle- accurate ISS solutions at the “Verification view” level. Software developers can work on the real target operating system if the host-mode is not accurate enough for them. If the ISS model(s) and associated software development tools can be fully synchronized with the SystemC hardware description of the system, the target software development can also start earlier in the design cycles.

In the second approach, we define a sub-level of abstraction which is called “Architects view” – which includes some timing information, simulates faster than cycle-accurate models, but is not as accurate as cycle-accurate models. This level is mainly used by system architects for performance analysis. Here, the methodology includes set of configurable hardware models at that abstraction level: generic buses, generic processor, generic DMA, data generators, etc. Using this methodology, the system architect can define hardware and software partitioning as well as target processors, bus architectures, memory hierarchies. Equally important, the system architect can add in timing and power metrics. It also supports token-based modeling, an abstract high-level modeling method that uses “tokens” (pointers) to represent the data structure resulting in exceptionally fast simulation performance—an important requirement for system performance analysis.

In addition, performance analysis functionalities can be used with custom models, so that system architects can run software “emulation” as testbench for their system performance analysis task. Think of it as a software emulation that runs as SystemC threads and therefore as it is part of the hardware simulation, but runs extremely fast. This capability can be used by the system architect at the highest level to find the best architecture to meet the design requirements. The tokens or pointers result in very high accuracy modeling for measuring the performance of the system. The system engineer can manipulate the parameters of the different blocks and test various configurations and use cases until reaching the required performance.

INTEGRATING SOFTWARE AND HARDWARE DEVELOPMENT

In markets extremely sensitive to cost and schedule slips, such as consumer electronics, hardware and software teams need to work together from the very outset to meet market windows. The emerging scalable TLM methodology described above moves software and firmware validation to the earliest stages in the design cycle, benefiting both teams. Software designers can now validate their applications and firmware long before hardware prototype. At the same time, the hardware team can concentrate on hardware development refinement without having to continually update models for the software validation.

By aligning the software and hardware flows at the earliest point possible, this approach minimizes integration risks downstream in the design flow. The result is significantly reduced chance of schedule slips even as the design team maximizes their product’s differentiation. The use of scalable TLM models is a crucial step in bridging software and hardware design methodologies, bringing them closer together towards the ultimate goal of true concurrent design.

Authors:

Alon Wintergreen
Corporate Applications Engineer
alon_wintergreen@mentor.com

Rami Rachamim
Product Marketing Manager
rami_rachamim@mentor.com

Verifying ASICs with FPGA Arrays

Thursday, October 16th, 2008
YouTube Preview Image

Memory Design Considerations When Migrating to DDR3 Interfaces from DDR2

Tuesday, September 23rd, 2008

Introduction

This white paper provides the reader with a detailed understanding of the key design considerations when migrating to a DDR3 system interface from a DDR2 interface and reviews the new DDR3 features, comparing and contrasting them to previous features available in the DDR2 specification. The biggest changes are the tightened timing requirements in the Physical Layer (PHY) portion of the memory interface. These changes are highlighted and illustrated with an example design of a high performance processor interface. The areas where backwards compatibility should be maintained are also illustrated with an example design, showing how simple changes can provide significant benefits in reuse and system flexibility.

A Comparison of DDR2 and DDR3 Memory Standards

The DDR2 memory standard is being upgraded with the advent of the DDR3 standard. The variety of memory devices available today provides the system architect with multiple options when selecting a memory. Before going into the detailed comparison of DDR2 and DDR3, let’s review the key features of a typical DDR2 memory subsystem and the associated memory controller. This will serve as a baseline for the detailed comparison.

DDR2 Description

A typical DDR2 memory subsystem uses a DIMM (Dual In-line Memory Module) to house multiple DDR2 memory devices. A typical DDR2 DIMM architecture is illustrated in Figure 1 below. The control and address signals come onto the DIMM and are routed to the memory devices in a T-branch topology. This architecture balances the delay to each memory device, but introduces additional skew due to the multiple stubs and the different stub lengths for each signal.

Figure 1   

Figure 1. DDR2 Dual In-line Memory Module Architecture

A DDR2 memory controller is located on the chip driving the DIMM module. A typical DDR2 memory controller is show in the block diagram in Figure 2. The PHY (Physical Layer) sub-system is responsible for the physical interface between the DDR DRAM (Double Data Rate Dynamic Random Access Memory) and the rest of the system. Timing is controlled precisely to insure data is captured or presented in just the right relationship with the DRAM clocking signals. Data read from the DRAM is optionally corrected by the ECC (Error Checking and Correcting) block and provided to the pending write FIFO (First In First Out). If ECC is being used, the ECC check bits are computed prior to the write to memory by another optional ECC block in the write path.

Figure 2   

Figure 2. DDR2 Functional Block Diagram

The scheduler (made up of the NT – Next Transaction, BSM – Bank State Machine, OS – Operation Selection, and GS – General State machine blocks in the middle right of Figure 2) prioritizes the current list of commands determining which command is the most urgent and then issues that command to the DRAM. Data is read or written to the memory based on the scheduler’s computation of access priority. The scheduler constantly works towards the goal of maximizing overall system efficiency and bandwidth while issuing all high priority commands as quickly as possible.

Commands are optionally pipelined and added to the pending FIFO. If the command is most urgent (direct read) it bypasses the pending FIFO and is issued directly to the memory. Regular priority accesses make their way through the pending read FIFO or the read token FIFO for command completion.

DDR3 Description

The main thrust of the DDR3 memory standard is to increase memory bandwidth while making it relatively easy for the designer to take advantage of this bandwidth increase. Innovations in the PHY portion of the DDR3 interface support this increase in bandwidth. The PHY innovations include Read and Write Leveling capabilities which allow for independent timing adjustments for the Read and Write paths. Innovations outside the PHY also help improve overall performance and reliability of DDR3 designs. These changes include a Reset Pin, to insure the proper initialization of the memory devices, an increase in the pre-fetch size to 8 bits from the 4 bits used previously in DDR2 and a ZQ calibration feature to simplify the calibration adjustment process. Each of these innovations will be explained in more detail in the following sections. We will start with a description of the DDR3 leveling features and then move on to the other DDR3 features.

DDR3 Leveling Features

The DDR3 specification can support a fly-by architecture either on a memory module or on a board. In this architecture, illustrated in Figure 3 below, the signals from the memory controller are connected in series to each memory component — in effect flying by each component instead of stopping there as in the DDR2 implementation shown in Figure 1. In a DDR3 memory module, the signals from the DDR3 PHY come into the middle of the module and connect to each memory chip sequentially. This reduces the number of stubs and the stub lengths. Termination is placed just at the end of the signal. This improves the signal characteristics over the traditional DDR2 topology.

Figure 3   

Figure 3. Fly-by Topology for DDR3 Un-buffered DIMM

The drawback to this approach is that the delay from the PHY output signals to each memory is slightly different, depending on where the memory chip is in the sequence. This delay difference needs to be compensated for by the DDR3 PHY, therefore it uses the new leveling feature required by the DDR3 specification. There is a different technique for both Write and Read Leveling.

Write Leveling

During Write Leveling, the memory controller needs to compensate for the additional flight time skew (difference in the signal delay to each memory device) introduced by the fly-by topology with respect to strobe and clock. In particular, the tDQSS, tDSS and tDSH timing requirements (those related to skew between the data strobe and clock) would be very difficult to meet. These timing parameters can be met by using a programmable delay element on DQS with fine enough granularity so the proper delay can be inserted to compensate for the additional skew delay. Figure 4 shows the needed timing relationship.

The source CK and DQS signals are delayed in getting to the destination, as illustrated by arrow #1 and arrow #2 respectively. This delay can be different for each memory component on the memory module and will be adjusted on a chip-by-chip basis and even on a byte basis if the chip has more than one byte of data. The diagram illustrates just one instance of a memory component. The memory controller repeatedly delays DQS, a step at a time, until a transition from a zero to a one is detected on the destination CK signal. This will realign DQS and CL so that the destination data on the DQ bus can be captured reliably. Because all this is done automatically by the controller, the board designer need not worry about the details of the implementation. The designer benefits from the additional margin created by the Write Leveling feature in the DDR3 memory controller.

Figure 4   

Figure 4. Timing Diagram for Write Leveling

Read Leveling

During Read Leveling, the memory controller adjusts for the delays introduced by the fly-by memory topology that impact the read cycle. This is done via the addition of a special Multi-Purpose Register (MPR) in the DDR3 memory device. The MPR can be loaded with predefined data values via a special command from the memory controller. These data values can be used for system timing calibration by the memory controller.

As shown in Figure 5, the MPR can be selected by setting a bit in another memory register (EMRS3, bit A2) to switch the source of data for memory read to come from the MPR, not the normal memory array. The MPR data is substituted for the DQ, DM DQS and /DQS pads on the memory device. This feature allows the memory controller to calibrate the timing of the read path to adjust for any additional delays introduced by the DDR3 fly-by architecture. Delays will be computed and inserted in the appropriate signals inside the controller to adjust for these additional read path requirements.

Figure 5   

Figure 5. Read Leveling Using MPR

Additional DDR3 Features

DDR3 has additional features to improve performance and reliability. These include a Reset Pin, an 8-bit pre-fetch, and ZQ calibration. A new Reset Pin is used to clear all state information in the DDR3 memory device without the need to individually reset each control register or power down the device. This saves time and power when bringing the device to a known state. The 8-bit pre-fetch is used in conjunction with burst lengths of 4 or 8. This improves performance for sequential accesses. The new ZQ calibration feature allows the memory device to take a longer time for calibration at start-up and a smaller time during periodic calibration activities. Table 1 below shows a feature-by-feature comparison of DDR, DDR2 and DDR3 memory devices.

  DDR DDR2 DDR3
Data Rate 200-400Mbps 400-800Mbps 800-1600Mbps
Interface SSTL_2 SSTL_18 SSTL_15
Source Sync Bi-directional DQS (Single ended default) Bi-directional DQS (Single/Diff Option) Bi-directional DQS (Differential default)
Burst Length BL= 2, 4, 8 (2bit pre-fetch) BL= 4, 8 (4bit pre-fetch) BL= 4, 8 (8bit pre-fetch)
CL/tRCD/tRP 15ns each 15ns each 12ns each
Reset No No Yes
ODT No Yes Yes
Driver Calibration No Off-Chip On-Chip with ZQ pin
Leveling No No Yes

Table 1. DDR, DDR2 and DDR3 Feature Comparison

Planning For Migration — An Example Design

In order to explore how to prepare a DDR2 design for migration to a DDR3 design, it will help to establish an example system. Let’s assume that the system will require a DIMM interface for DDR2 and will need to use a similar type of memory module in the DDR3 implementation. Performance is increasingly important for many applications so the decision is to initially design the controller as a DDR2 design, but to allow future migration to DDR3. As much as possible, we want to make it easy to modify the board and the memory controller to migrate from the DDR2 implementation to a DDR3 implementation.

Board Level Issues

One of the biggest issues when thinking of migrating from DDR2 to DDR3 is that the DDR2 and DDR3 DIMMs have different pin-outs and sizes. This means that it will be very difficult, at the board level, to create a single DIMM receptacle and then be able to plug in either a DDR2 or DDR3 memory module. The best approach is to take into account the key board level differences between the two standards, and by planning ahead make it easier to implement changes to the board (modify the DIMM footprint and change some traces) for a DDR3 module. For example, it should be possible to plan for the inclusion of the DQS change and the Reset Pin ahead of time, in order to make it easier to add the DDR3 DIMM footprint and re-layout the board.

DQS

In DDR3, DQS is specified as differential while in DDR2 it can be single ended or optionally differential. Clearly, if the differential version is used in DDR2 it will make the transition to DDR3 easier. This may require additional pins in the memory controller, but if upward compatibility is important the extra pins will be worth it. The DDR2 implementation will also be more robust.

In DDR3, the DQS is sourced by each memory device in order to account for the additional delay from the fly-by topology. The number of DQS signals is therefore larger in the DDR3 implementation than in the DDR2 version. Again, if the additional pins are not a concern, it will help with the migration to DDR3 in implementing the additional DQS signals in the DDR2 implementation.

Reset Pin

The Reset Pin present in DDR3 is easy to add to DDR2. Although the pin will not do anything in the DDR2 implementation, although including it will insure that the pin is available when it is time to migrate to DDR3.

Memory Controller Issues

Other aspects of the DDR2 to DDR3 migration will require some impact to the memory controller. If the DDR2 memory controller is designed with some of these issues in mind, it can simplify the process considerably. Some of the most important issues are the Output Drivers, DLLs (Delay Locked Loop – the key building block for adjusting timing for critical periodic signals) for Write Launch, and Read Leveling.

Output Drivers

The DDR2 standard calls for 1.8V SSTL I/Os. DDR3 calls for 1.5V SSTL I/Os. It may be difficult to find an I/O buffer that can support both standards. It might require a programmable I/O, similar to those found on FPGAs (Field Programmable Gate Arrays), to support both standards. A change in I/O buffers would require a spin of the chip driving the DDR3 memory, but perhaps a metal mask option could be used to make this change less expensive.

DLLs for Write Launch

Typical DDR2 memory controllers can get away with one DLL for several data outputs. In DDR3, due to the fly-by topology, it will be more usual to see a DDL for every 8-bits or so. This would require a larger number of DLLs to be included in the DDR2 design in order to provide the resources required for the DDR3 migration. A digital DLL implementation can be very compact in die size and can minimize the overhead associated with the DDR3 requirement.

Read Leveling

Typical DDR2 memory controllers use an extra pair of I/O pins to calibrate the controller read timing. These pins are used to help adjust the incoming data with respect to the strobe. Other controllers use a training sequence by writing and reading data from memory and adjusting the strobe to optimize the data capture point. In DDR3, the Read Leveling feature is used to do this and requires no additional pins. If the memory controller can be designed to include the Read Leveling feature, even if not used in DDR2, it would help significantly with DDR3 migration.

SUMMARY

DDR3 offers a substantial performance improvement over previous DDR2 memory systems. New DDR3 features, all transparently implemented in the memory controller, improve the signal integrity characteristics of DDR3 designs so that higher performance is achieved without an undue burden on the system designer. If proper consideration is given to any new DDR2 memory design, it can be a relatively easy upgrade to support DDR3 in the next generation design. This paper identified the key differences between DDR2 and DDR3 and illustrated some of the key issues that need to be addressed for easy migration to DDR3.

If you have further questions, please contact Virage Logic Customer Support.

On the web: www.viragelogic.com/contact

Via e-mail: support@viragelogic.com

Toll-free: (877) 360-6690

Or: Virage Logic Corporation
47100 Bayside Parkway
Fremont, CA 94538
510-360-8000

New Challenges For Hardware Engineers

Tuesday, September 16th, 2008

 

It used to be fun to be a chip architect. You could wake up in the morning, grab a cup of strong black coffee and run through a few power and performance tradeoff calculations before deciding on the high-level architecture. That would set the engineering direction for months, if not years. On a good day, after introducing a steady infusion of caffeine into your bloodstream, you felt like the all-powerful creator of an electronic universe.

 

That dream job began showing its first signs of vulnerability at the 130nm process node, especially as the SoC began emerging as the leading design platform. The job description began weakening further at 90nm, and by 65nm it has transcended into something far less satisfactory—and the trend only gets worse from here. More people are entering into the conceptual design phase of building a chip with each rev of Moore’s Law. Suddenly, there are people talking about power budgets and yield and verification engineers trying to build in ways to solve their problems earlier. Managers are screaming for first-time silicon success. And software engineers—who, incidentally, no one has ever understood very well—are now sitting at the table at initial conception, slurping Diet Coke or Mountain Dew, and speaking a language no hardware engineer can understand.

 

Welcome to the brave new world of hardware engineering. It’s called system-level design, and it’s become so complex that just to get the job done now requires steady and concurrent input of multiple disciplines. Engineers are struggling to keep up with multiple power domains, multiple cores that exist only because classical scaling for performance died at 90nm, and timing issues that get complicated by shared busses, shared memory, and shared resources within engineering groups.

 

“The technologies for low-power design are well understood for silicon,” says Nikhil Jayaram, director of CPP engineering at Cisco Systems. “The challenge is in the complexity of those technologies. You have to ask yourself, can you pull it off in a reasonable design cycle?”

 

The answer is always yes, of course, but the cost is not always easy to swallow. Complexity is measured in terms of additional resources. Jayaram said that number is about 20% to 50% extra per design, depending upon the complexity of the design itself. Why? “You have to buy more tools and use more people.”

 

There are plenty of tools, too. In order to address this complexity, vendors have been introducing a steady stream of new tools that raise abstraction levels or combine multiple tasks. Those go hand in hand with new standards such as TLM 2.0. But the learning curve on these new tools and standards is quite steep, demanding time from engineers who are hard pressed already. Even the IP that is supposed to simplify chip design and development is so complicated that it often needs additional IP just to be able to ensure it can be debugged or manufactured properly.

 

One verification engineer at a very large, well-known chip maker (he asked for anonymity because he didn’t get approval from his bosses before talking to System-Level Design), said overload is becoming a serious issue among engineers.

 

“Designers are required to become experts in three completely different languages that the industry has standardized on as mainstream,” says the engineer. “The languages are SVA (System Verilog Assertion) for the assertion-based methodology, SV (System Verilog) for the testbench methodology, and C/C++ for system-level hardware/software verification. A verification engineer cannot get by without becoming an expert in these three languages. The way to deal with this is through the right schooling so that engineers come out with the expertise in all three. Standards have definitely helped with this. The frustration of course will be for the engineers that are on the job for many years and now need to become skillful in three different areas. As things are today, I am finding it very difficult to justify all three methodologies to my customers and they are missing out on quality because of this.”

 

That’s only part of the problem in verification. While five years ago engineers were complaining about getting too little data back from foundries such as TSMC, UMC and Chartered Semiconductor, they’re now complaining about being flooded with data. There are volumes of it—literally—and there’s no way other than just plain luck to pinpoint a bug without running tests on broad areas of that data. TLM 2.0 purportedly will help (see related story), but it also has a fairly high learning curve to be able to use TLM 2.0 tools. How do you construct a test model, for example, using object-oriented code?

 

There’s a reason why verification is still 70 percent of the NRE time budget and cost for developing new chips. Despite throwing lots of money, resources, and the best minds in the world at the problem, that number hasn’t budged much.

 

IP, Verification IP, and insurance IP

Nowhere is this overload more evident than in the IP world. Why write a piece of code for a standard interface or a piece of memory if someone with experts on the bleeding edge of technology has already done it? That way of thinking is growing. IP is a big market, and the problems of five years ago when companies bought advanced IP only to face challenges—and potentially huge expense—getting it to work are enormous.

 

Buying IP isn’t like buying a pair of shoes. It’s more like setting up a deep partnership that lasts for the life of a chip’s many iterations. And getting those partnerships to work properly can be a time-consuming process. That explains why many of the smaller IP companies have evaporated even though a decade ago pundits said the barrier to entry for IP startups would create a vast array of parts that could be simply plugged into a system on chip. Things didn’t work out so well in the real world.

 

“When you walk in to a partnership you need to get a complete match on the methodologies and tool sets,” said an engineer, who spoke on condition that he not be named. “This is soooo difficult. Very high level managers are finding themselves bleeding trying to make this work. Your tool set may be delivered by multiple vendors in addition to internal tools. Internal tools cause even more problems that are related to support, IP, etc.”

 

The engineer noted that standards will help solve this—everything from standard formats, standard languages and standard methodologies, which is what the new verification IP committee is trying to tackle.

 

Business, As Usual?

Beyond all of this, there is the incursion of the business groups. It was bad enough to build chips that worked. Now they have to be built on time, within a financial budget, and they have to include more complex technology and tricks than ever before.

 

One solution for keeping chips in budget is using the lowest-cost tools. The problem with that approach, say engineers, is that not all tools share exactly the same functionality. So what happens when you run simulators such as VCS (Verilog Compiler Simulator, formerly from Chronologic but now owned by Synopsys), IUS (Cadence Incisive Unified Simulator), and (Mentor Graphics’) ModelSim? The answers to that question vary by project, and frequently for the same project.

 

But no matter how bad it looks, at each new process node there will be more cooks in the kitchen. You can fight it, ignore it, embrace it, but know that only the last choice is the right answer.

 

Ed Sperling

 

Cross-Talking with TLM 2.0

Tuesday, September 16th, 2008

By Ed Sperling

It’s almost like flying over the Great Plains of the United States. On the ground it’s hard to see above the corn stalks, but in an airplane you can see the entire horizon even if you can’t see those stalks anymore.

The analogy is similar to where most of the major players in chip design say the engineering for systems on chips needs to go. With millions more gates available at each new process node, compounded by multiple power domains and incredibly complex timing issues, scrutinizing detail at the RTL or pin level is becoming less important than seeing the big picture and drilling down from there. The diagram is a lot easier to plan, follow and verify at a higher altitutde, even if the details are a little blurry.

This top-down approach is the basis of the new Transaction-Level Modeling (TLM) 2.0 standard created by a working group of the Open SystemC Initiative. It’s still not perfect—in fact, some engineers say it’s a long way from that—but it’s a lot better than what was there before. And it opens the door for more concurrent design possibilities so that increasingly complex SoCs can be developed at least as quickly as previous generations of chips and pieces can be re-used much more easily.

What’s new?

The first attempt at raising the level of abstraction into what became known and overhyped as electronic system-level design, or ESL, was the TLM 1.0 standard, which was introduced in June 2005. However, TLM 1.0 didn’t allow engineers to bridge together various different tools, so verification engineers had no easy way of linking back to the design engineers or the chip architects. Some chip developers developed their own proprietary bridges between those areas, with inconsistent levels of success.

TLM 2.0 adds structure to this confusion, allowing users of tools that comply with the standards to build models that can be tested and verified across an entire system on a chip.

“The biggest challenge today is models,” said Glenn Perry, general manager of ESL/HDL Design Creation at Mentor Graphics. “If you get some from your IP vendors and some from your internal modeling group, there is no guarantee that these models work together. Historically there have been different protocols for the way these models interface and communicate with each other. TLM 2.0 provides a broad communication standard that makes it easy for these models to connect. The tools themselves—only recently a few EDA vendors have taken it into the mainstream. Design analysis, synthesis and verification.”

How chip developers define those models determines the speed and granularity for developing design components and generating test results. TLM is an abstraction of a design. It can be analyzed and simulated, and engineers can interact with it in various ways. For example, loosely timed abstractions give more general results more quickly, while more exact timing models generate more detailed results—although more slowly. These are the kinds of tradeoffs designers will need to consider in the future, as shown in the following diagram from the Open SystemC Initiative, which developed TLM 2.0.

Regardless of which approach is taken, however, all the tools have to work in the same sandbox. “There are two kinds of design flows,” said Jakob Engblom, technical marketing manager at Virtutech in Sweden. “One is from the system guys, who build a box, not an SoC, and then they make the hardware work. The second is from the software developers, who have to make the software work with the hardware. Clearly, there is more and more need for concurrent design, and it needs to be done on the system level, not the component level. An SoC is more than a hardware flow. There is only value at the hardware level if you can add a higher level of abstraction to run software, too.”

Engblom said that simulation at the pin level is no longer viable because it takes too long. There are simply too many parts to simulate—millions of gates, multiple power domains, complex memory and logic structures and shared busses. That job becomes even more complicated when you factor in multiple cores and the software that needs to be developed to take advantage of those cores, and in future iterations of chip development, stacked dies.

The tradeoff with TLM 2.0 is using a loosely-timed abstraction. At every level you gain performance and lose detail. But you can connect that to more detailed models where you need them. This is a standard, not a device library. You use it to build models and test at the level of abstraction you need. There’s still no getting around creating a system map and a lot of hard work, but the choice is simulating small with a great level of detail or simulating the big picture with a lower level of detail.

Debugging

That’s particularly useful in the verification world, which accounts for about 70 percent of the time it takes to develop a chip. While models have to be relatively accurate in the design phase, they have to be 100 percent accurate in the verification phase. There is little future for companies that develop chips that do not comply with the original design, or which fail unexpectedly.

What has been frustrating in this area, however, is that verification engineers are working with massive amounts of data and no effective way to pinpoint where bugs are. As a result, they have to pore through all the data to find the bugs. Functional verification proposes to move verification up a level of abstraction, as well, but the models still need to be integrated into the overall system. TLM 2.0 supports those kinds of models, which ultimately may reduce the time it takes to debug complex chips.

“What’s changed is that now you can build all the models in a way that’s useful for them,” said Mike Meredith, president of the Open SystemC Initiative, which created the TLM 2.0 standard as part of a working group involving all the major EDA vendors as well as companies such as STMicroelectronics, Broadcom, Texas Instruments, Infineon, and an array of ESL startups such as Virtutech and Meredith’s own Forte Design Systems. “The standard is agnostic about the processor and the busses.”

TLM also allows engineers to describe a test bench in transaction terms. That means data can be looked at functionally rather than trying to understand the individual bits in a transaction. But you can have a TLM-based test bench and still be relatively lost if you haven’t evolved the debug platform, said Perry. “One nice addition is the debug transaction interface. You can do really intelligent things with this.”

The future—more speed, more tweaks

But that intelligence may be limited to certain areas of chip design. Said one engineer, who asked not to be named: “My basic complaint regarding TLM 2.0 is the TLM working group focused too much on memory-mapped buses and the SOC or single ASIC as a focus. In my opinion they left the system out of ESL. In our systems the processor is important, but the processor and its surrounding registers and such are only 10% or less of a single ASIC, and frankly are not really a challenge in ASIC design.”

The engineer said it’s more important to model a host operating system such as Windows or Linux, running specific drivers and customer applications, talking to a virtual storage host bus adapter, and running actual firmware, which is in turn talking to a virtual storage area network with hundreds or even thousands of attached devices. “TLM 2.0 helps us with a tiny, tiny sliver of that rather large task,” he said. “We will use TLM 2.0 when picking up models from vendors where it makes sense, but we will not be rewriting any code to use TLM 2.0. Nor will new code development use TLM 2.0 directly. I think the rush to standardize TLM 2.0 is premature since it really has not yet proven itself. TLM 2.0 has some good features. Defining the possible modeling levels and defining how they interact is good. The idea of sockets to bundle ports together is good, though it can add a large coding burden for someone trying to implement a module to conform to an interface. So in a nutshell, TLM 2.0 is okay for a vendor writing a handful of modules connected to an AXI bus, but it is less suited to modeling large systems with many custom or specialized interfaces. Sure TLM 2.0 can be forced to work in these situations, but in the end it is no better than what I have now.”

Already, changes are afoot to rectify some of these problems. Sources involved in the TLM 2.0 discussions say the next steps are improving performance in such areas as direct memory access, which includes access time between the CPU and the memory. In C++, performance reportedly is almost double what it is in SystemC, the basis of TLM 2.0.

There also is room for improvement in the future with network-on-chip designs, where multiple bus architectures have to be bridged together. Some of that capability is included in TLM 2.0, but expect enhancements in future updates on the standard.

In addition, there are some tricks being used by companies that currently are not included in the standard. One is to build more optimized allocators, which can speed up performance by three to five times. The trick is learning the methodology in the standard and understanding it well enough to be able to use it more effectively. For many companies working with TLM 2.0, the standard is just a starting point. It’s also a bridge point between various disciplines that have never worked seamlessly together—people with different areas of expertise who now must work on projects concurrently instead of in series.

“What you’re going to see is synergy in design teams as they begin talking to each other and as people learn new skills,” said OSCI’s Meredith. “As development becomes a critical part of design, you’re going to see entirely new job categories emerge.”