Designing the Right Architecture Part ISoC Interconnect and Memory Optimization with Synopsys Platform Architect
Abstract Part I
Designing the right architecture of a multi-processor SoC for today’s sophisticated electronic products is a challenging task. The most critical element for meeting the performance requirements of the entire system is the interconnect and memory architecture. These SoC infrastructure IP components are highly configurable and need to be customized to the communication needs of all the other modules on the chip, such as the application processor, the graphics unit, and all the external connectivity IP. Finding the right configuration of the interconnect and memory IP to balance performance requirements and cost considerations requires an efficient performance analysis methodology, which allows for early and accurate investigation of architectural trade-offs.
In the first part of this two-part article we present a tool-assisted system-level performance analysis flow for interconnect and memory performance optimization using Synopsys Platform Architect. This environment allows the rapid creation of system-level performance models and the parallel simulation of many design configurations to investigate a wide range of architectural options.
In the second part we will present the results of a design project, where Platform Architect has been used to optimize the performance of a multicore mobile platform SoC.
Why Do We Care?
Incorporating more and more functions and features into electronic products directly translates into increasing SoC design complexity. Devices integrate a multitude of heterogeneous programmable cores to achieve the necessary flexibility and power efficiency. The diverse communication requirements of all these cores lead to complex interconnect and memory infrastructure to provide the required storage and communication bandwidth.
For this purpose the SoC interconnect and memory sub-system feature complex mechanisms like distributed memory, cascaded arbitration, and Quality of Service (QoS). As a result, dimensioning the interconnect and memory infrastructure poses a variety of formidable design challenges:
- Large Design Space
Due to the complexity and configurability of the SoC infrastructure IP (interconnect, memory), tailoring the SoC infrastructure to the specific needs of the product requirements is a non-trivial task.
- SoC Infrastructure Specialization
Where as many IP blocks can be reused off- the-shelf, the SoC infrastructure needs to be customized to serve the specific communication requirements of all IP blocks. Even for derivative designs, where only 1 or 2 IP blocks change, the SoC infrastructure needs to be adjusted to meet the new overall performance requirements.
- Dynamic Workload
Different applications running at different points in time are sharing a limited set of available resources. Hence, the workload on the SoC infrastructure is difficult to estimate due to the multitude of different product use-cases.
- High Price of Failure
A weakly dimensioned SoC infrastructure leads to insufficient product performance, which in turn might cause a missed market opportunity.
- High Potential for Optimization:
Typically the SoC infrastructure either wastes area and power due to over-design or fails to deliver the specified performance. Hence there is a high potential for optimization to get it just right. An optimized interconnect and memory architecture can significantly lower area and hence the SoC fabrication cost. In some cases over-design might be intentional to allow new functionality to be added in a later version of the silicon, extending the Time-In-Market of the architecture. However in this case, the head-room in the capacity of the SoC infrastructures needs to be well quantified.
These challenges constitute the motivation for an efficient performance analysis methodology, which allows the early quantitative analysis of the system performance in order to optimize the interconnect and memory architecture. This concerns both device manufacturers, who want to influence their semiconductor providers such that all their application use-cases are well supported, as well as chip manufacturers, who want to address a variety of customer use-cases.
How Early means Early?
The dimensioning of the SoC infrastructure is one of the first steps in a design project. As depicted in Figure 1, the input comes from the marketing requirements in terms of required features, supported features, performance numbers, and product cost. At this stage the high-level SoC block diagram is more or less defined. This refers primarily to the set of IP blocks used as well as the high-level connectivity. It does not include architectural details like the exact topology.
|Figure 1: Input and Output of an Interconnect and Memory Performance Optimization Flow|
The outcome of a performance analysis project is the optimized configuration of the interconnect- and memory-architecture, which feeds into the RTL-to-GDSII flow. Typically an architecture evaluation report also summarizes the findings and provides recommendations for the final architecture.
Traditional methods for Architecture Analysis
The Architecture definition has always been a necessary step in any SoC design project. Traditionally the performance has been analyzed using spreadsheets or detailed HW simulation. However, the design complexity has reached a level where these methods are not appropriate any more. On the one hand, static spreadsheet analysis does not take the dynamic behavior of multiple software applications and multiple levels of scheduling and arbitration in the executing hardware platform into account. This bears a great risk of miss-predicting the actual performance, which can lead to under- or over- design of the system architecture. On the other hand, hardware simulations are available late in the cycle, run very slow, and do not provide system-level performance analysis results. Hence, this is also not a suitable approach for early architecture analysis and optimization.
Performance Analysis Flow Overview
The recommended flow for early quantitative SoC performance analysis is illustrated in Figure 2. At first we create a performance model of the SoC platform using a combination of cycle accurate transaction-level models for the interconnect and memory architecture as well as workload models for the relevant bus masters. Now a multitude of simulations are executed to investigate the impact of architectural choices, design parameter configurations and workload situations on the given set of performance and cost metrics.
|Figure 2: Iterative Architecture Optimization Flow|
The top-level performance metrics for each design configuration in such a parameter sweep are automatically aggregated into a single spreadsheet. This allows for a very efficient sensitivity analysis of how design parameters influence certain performance metrics. At the same time, each simulation generates a detailed set of analysis results, which enables in-depth root-cause analysis of performance issues.
The outcome from the investigation of both the high level and the low level analysis results should be a concrete strategy on how to further improve the architecture. This is the start of the next round of optimizations, until finally the results meet the given performance and cost constraints.
The Requirements for Efficient Performance Analysis
A simulation based performance analysis methodology can only be deployed in production SoC design projects if the results are reliable to drive important design decisions and if the results can be obtained with reasonable effort in a short amount of time. The high level observation can be broken down into the following requirements:
Performance analysis is used to check if the performance requirements are met and to optimize the SoC infrastructure. The analysis methodology needs to be sufficiently accurate to give the system architect the confidence to take design decisions.
- Product Use-Case Analysis
The SoC infrastructure needs to deliver sufficient communication bandwidth under all relevant product use-cases. Therefore the system architect must be able to efficiently analyze the impact of the workload, which is imposed by different use-cases onto the interconnect and memory architecture.
- Quick Setup Time
Typically the time window to define or influence the system architecture is very short, i.e. in the order of a few months at most. Hence, a performance analysis methodology needs to be set-up and deliver results within a few weeks.
- Quick Turn-Around Time
Architecture optimization entails the investigation and comparison of different design configurations. Since the number of design parameters and product use-cases is typically very large it is mandatory that architectural alternatives can be evaluated on a daily, if not hourly, basis.
It must be possible to obtain all relevant performance metrics (latency, throughput, utilization, efficiency). These metrics enable the system architect to evaluate architectural trade-offs and to come up with an optimum architecture.
In the next sections we talk about how to obtain the two major ingredients for an efficient performance analysis flow: a performance model of the SoC platform and a workload model of the application use-case.
Creating a Performance Model of the SoC platform
As illustrated as part of the case-study in Figure 5, the key idea is to focus on the relevant components:
- Omit all components that do not consume any significant bandwidth of the SoC infrastructure.
- Replace all significant initiator IP blocks with workload models that generate the corresponding bus traffic.
- Deploy accurate SystemC transaction-level models for the SoC interconnect and memories, as those components are the focus of the performance investigation.
- Use simulation and analysis to measure the relevant performance metrics.
Because the methodology only requires a few components, we can quickly assemble a flexible performance model, which can be used to efficiently investigate the performance of the SoC with the necessary accuracy.
Creating Workload Models of the Application Use-Case
Workload models replace the actual models of the initiator IP sub-systems. The goal is that the workload models generate bus traffic equivalent to the actual IP blocks. This means that the traffic does not need to be identical on a cycle-by-cycle basis, but that the performance profile of the generated bus traffic is identical, i.e. the measured performance metrics such as latency, throughput, burstiness, utilization, etc. are identical
|Figure 3: Task-driven (left) and Trace-driven (middle) Workload modeling as opposed to executing real SW (right)|
Workload models are the most productive way to stimulate the interconnect and memory architecture with realistic traffic while taking all use-cases of the SoC into account. The alternative approach for using workload models would be to use fully functional and cycle accurate models of the IP sub-systems (see right side of Figure 3):
- The programmable IP sub-systems require cycle accurate representation of the CPU, capable of running the actual SW. Depending on availability, this can be a cycle-accurate SystemC TLM instruction set simulator (CA ISS) or the CPU IP RTL running in co-simulation or co-emulation with the SystemC TLM platform.
- The non-programmable IP blocks need to be modeled in terms of cycle accurate SystemC TLM models.
Creating such a fully-functional and cycle-accurate platform model requires a lot of initial modeling effort. Even if all models are available, it can be cumbersome to configure the SW such that all the relevant traffic scenarios are covered. Exploring architectural changes can be difficult, because of the effort to change the detailed platform model.
For the reasons outlined above, the initial performance analysis should be carried out using workload models. As the platform specification matures, the workload models can be incrementally replaced by the cycle-accurate functional models or the CPU IP RTL. This gradually converts the performance exploration model on the left side of Figure 3 into a cycle accurate virtual prototype as depicted on the right. This way the initial assumptions in the workload model can be validated and the architecture can be fine-tuned.
In general there are different methods to create a workload model:
- Trace-driven workload models generate bus traffic based on a transaction file.
- Task-driven workload models generate bus traffic based on a task-graph performance model of the application.
In the following we focus on Trace-driven workload models.
Creating Elastic Trace-Driven Workload Models
Synopsys Platform Architect promotes the usage of the Socket Transaction Language (STL) for the definition of trace-driven workload models. This flexible trace format is executed by the “Generic File Reader Bus Master” (GFRBM), which has a TLM-2.0 Approximately Timed initiator socket, which can be connected to arbitrary interconnect fabrics. Apart from the general advantages of using workload models mentioned in the Workload Models section, traffic generation based on the GFRBM and STL provides the following advantages:
- The creation of STL files is simple and relatively little effort.
- If available, reference traces recorded by RTL simulations, an evaluation board, or a cycle accurate ISS can be easily translated into STL files.
- Unlike stochastic traffic generators, STL allows specifying each transaction individually. This way it is little effort to set up specific scenarios, which exercise a specific function or feature of the interconnect and memory subsystem.
- "Accuracy" of the workload model is achieved by the GFRBM interface in combination with the respective transactor covers. For example in case of AXI all relevant protocol features like e.g. burst types, pipelining, out-of-order transactions and protocol specific in-band attributes are supported.
- "Configurability" of the workload model is achieved by generating different STL files for different IP configurations and operating modes.
- "Elasticity" and "adaptability" of the workload model is achieved by specific features of the GFRBM and STL related to multi-threading and waiting, which enables the specification of complex and dynamic scenarios.
The synchronization of multiple traffic streams on the same initiator or between different initiators is especially important so that the workload model responds to changes in the architecture in a similar way as the real initiators.
|Figure 4: Elastic Transaction Trace|
In the scenario shown in Figure 4, the traffic file mimicking CPU data triggers a DMA transfer and then waits for completion before it continues. After improving the performance of the interconnect or memory, e.g. by increasing the clock frequency, the same workload model will execute differently: the CPU traffic is executed in a shorter time, so the DMA transfer can start earlier. These kinds of dependencies between traffic streams need to be captured as part of the workload model, otherwise it is not useful for architecture analysis, which is all about investigating the impact of changes in the architecture configuration on the performance in the system.
Automating the Creation of Trace Files
Although the STL file format is rather simple, it is also very verbose. Except for very short transaction sequences it is typically not recommended to manually write the STL files. Platform Architect provides utilities to generate STL files from an input description. The following STL generation utilities are provided:
- A Trace Conversion Utility converts available AMBA vcd trace files into STL.
- A Stochastic Trace Generation Utility generates a pseudo-random transaction sequence based on a few stochastic parameters. This is the preferred approach for programmable IP blocks, which exhibit a complex and unpredictable traffic pattern.
- A Deterministic Trace Generation Utility generates a well-defined transaction sequence based on a few IP specific configuration parameters. This is the preferred approach for non-programmable IP blocks, which exhibit a very predictable traffic pattern.
- Any transaction sequence from a SystemC transaction level model can be converted into STL so that slow simulation models can be replaced with trace-driven workload models.
When choosing the right utility, it is important to keep in mind the relevance of the generated address sequence. The DRAM performance is fairly dependent on the address sequence, since the address sequence determines if a pre-charge/activate delay occurs or not. Also the cache performance is highly dependent on the address sequence, since the address sequence determines if a cache hit or miss occurs. This in turn has a high impact on the request latency and the traffic load on the SoC infrastructure. In case a DRAM is part of the performance model even the stochastic STL generation utility should make an effort to achieve a certain accuracy of the address sequence. In case a functional cache model is used stochastic STL generation is not an option, because minor inaccuracies of the generated address sequence adulterate the analysis results.
Summary Part I
Today’s complex designs rely on optimal architecture configuration to achieve the right balance between latency, throughput, cost, and power. In the first part of this article we present a tool-assisted performance analysis flow for AMBA-based SoC designs. In the upcoming second part we will illustrated this flow based on the performance analysis of a typical multicore mobile SoC platform.
Tim Kogel received his diploma and PhD degree in electrical engineering with honors from Aachen University of Technology (RWTH), Aachen, Germany, in 1999 and 2005 respectively. He has authored a book and numerous technical and scientific publications on electronic system-level design of multi-processor system-on-chip platforms. Today, he is working as a Solution Architect at Synopsys Inc. In this position, he is responsible for the product definition and future direction of Synopsys' SystemC-based Platform Architect product line.