Published on January 10th, 2013
Abstract Part I
Designing the right architecture of a multi-processor SoC for today’s sophisticated electronic products is a challenging task. The most critical element for meeting the performance requirements of the entire system is the interconnect and memory architecture. These SoC infrastructure IP components are highly configurable and need to be customized to the communication needs of all the other modules on the chip, such as the application processor, the graphics unit, and all the external connectivity IP. Finding the right configuration of the interconnect and memory IP to balance performance requirements and cost considerations requires an efficient performance analysis methodology, which allows for early and accurate investigation of architectural trade-offs.
In the first part of this two-part article we present a tool-assisted system-level performance analysis flow for interconnect and memory performance optimization using Synopsys Platform Architect. This environment allows the rapid creation of system-level performance models and the parallel simulation of many design configurations to investigate a wide range of architectural options.
In the second part we will present the results of a design project, where Platform Architect has been used to optimize the performance of a multicore mobile platform SoC.
Why Do We Care?
Incorporating more and more functions and features into electronic products directly translates into increasing SoC design complexity. Devices integrate a multitude of heterogeneous programmable cores to achieve the necessary flexibility and power efficiency. The diverse communication requirements of all these cores lead to complex interconnect and memory infrastructure to provide the required storage and communication bandwidth.
For this purpose the SoC interconnect and memory sub-system feature complex mechanisms like distributed memory, cascaded arbitration, and Quality of Service (QoS). As a result, dimensioning the interconnect and memory infrastructure poses a variety of formidable design challenges:
These challenges constitute the motivation for an efficient performance analysis methodology, which allows the early quantitative analysis of the system performance in order to optimize the interconnect and memory architecture. This concerns both device manufacturers, who want to influence their semiconductor providers such that all their application use-cases are well supported, as well as chip manufacturers, who want to address a variety of customer use-cases.
How Early means Early?
The dimensioning of the SoC infrastructure is one of the first steps in a design project. As depicted in Figure 1, the input comes from the marketing requirements in terms of required features, supported features, performance numbers, and product cost. At this stage the high-level SoC block diagram is more or less defined. This refers primarily to the set of IP blocks used as well as the high-level connectivity. It does not include architectural details like the exact topology.
|Figure 1: Input and Output of an Interconnect and Memory Performance Optimization Flow|
The outcome of a performance analysis project is the optimized configuration of the interconnect- and memory-architecture, which feeds into the RTL-to-GDSII flow. Typically an architecture evaluation report also summarizes the findings and provides recommendations for the final architecture.
Traditional methods for Architecture Analysis
The Architecture definition has always been a necessary step in any SoC design project. Traditionally the performance has been analyzed using spreadsheets or detailed HW simulation. However, the design complexity has reached a level where these methods are not appropriate any more. On the one hand, static spreadsheet analysis does not take the dynamic behavior of multiple software applications and multiple levels of scheduling and arbitration in the executing hardware platform into account. This bears a great risk of miss-predicting the actual performance, which can lead to under- or over- design of the system architecture. On the other hand, hardware simulations are available late in the cycle, run very slow, and do not provide system-level performance analysis results. Hence, this is also not a suitable approach for early architecture analysis and optimization.
Performance Analysis Flow Overview
The recommended flow for early quantitative SoC performance analysis is illustrated in Figure 2. At first we create a performance model of the SoC platform using a combination of cycle accurate transaction-level models for the interconnect and memory architecture as well as workload models for the relevant bus masters. Now a multitude of simulations are executed to investigate the impact of architectural choices, design parameter configurations and workload situations on the given set of performance and cost metrics.
|Figure 2: Iterative Architecture Optimization Flow|
The top-level performance metrics for each design configuration in such a parameter sweep are automatically aggregated into a single spreadsheet. This allows for a very efficient sensitivity analysis of how design parameters influence certain performance metrics. At the same time, each simulation generates a detailed set of analysis results, which enables in-depth root-cause analysis of performance issues.
The outcome from the investigation of both the high level and the low level analysis results should be a concrete strategy on how to further improve the architecture. This is the start of the next round of optimizations, until finally the results meet the given performance and cost constraints.
The Requirements for Efficient Performance Analysis
A simulation based performance analysis methodology can only be deployed in production SoC design projects if the results are reliable to drive important design decisions and if the results can be obtained with reasonable effort in a short amount of time. The high level observation can be broken down into the following requirements:
In the next sections we talk about how to obtain the two major ingredients for an efficient performance analysis flow: a performance model of the SoC platform and a workload model of the application use-case.
Creating a Performance Model of the SoC platform
As illustrated as part of the case-study in Figure 5, the key idea is to focus on the relevant components:
Because the methodology only requires a few components, we can quickly assemble a flexible performance model, which can be used to efficiently investigate the performance of the SoC with the necessary accuracy.
Creating Workload Models of the Application Use-Case
Workload models replace the actual models of the initiator IP sub-systems. The goal is that the workload models generate bus traffic equivalent to the actual IP blocks. This means that the traffic does not need to be identical on a cycle-by-cycle basis, but that the performance profile of the generated bus traffic is identical, i.e. the measured performance metrics such as latency, throughput, burstiness, utilization, etc. are identical
|Figure 3: Task-driven (left) and Trace-driven (middle) Workload modeling as opposed to executing real SW (right)|
Workload models are the most productive way to stimulate the interconnect and memory architecture with realistic traffic while taking all use-cases of the SoC into account. The alternative approach for using workload models would be to use fully functional and cycle accurate models of the IP sub-systems (see right side of Figure 3):
Creating such a fully-functional and cycle-accurate platform model requires a lot of initial modeling effort. Even if all models are available, it can be cumbersome to configure the SW such that all the relevant traffic scenarios are covered. Exploring architectural changes can be difficult, because of the effort to change the detailed platform model.
For the reasons outlined above, the initial performance analysis should be carried out using workload models. As the platform specification matures, the workload models can be incrementally replaced by the cycle-accurate functional models or the CPU IP RTL. This gradually converts the performance exploration model on the left side of Figure 3 into a cycle accurate virtual prototype as depicted on the right. This way the initial assumptions in the workload model can be validated and the architecture can be fine-tuned.
In general there are different methods to create a workload model:
In the following we focus on Trace-driven workload models.
Creating Elastic Trace-Driven Workload Models
Synopsys Platform Architect promotes the usage of the Socket Transaction Language (STL) for the definition of trace-driven workload models. This flexible trace format is executed by the “Generic File Reader Bus Master” (GFRBM), which has a TLM-2.0 Approximately Timed initiator socket, which can be connected to arbitrary interconnect fabrics. Apart from the general advantages of using workload models mentioned in the Workload Models section, traffic generation based on the GFRBM and STL provides the following advantages:
The synchronization of multiple traffic streams on the same initiator or between different initiators is especially important so that the workload model responds to changes in the architecture in a similar way as the real initiators.
|Figure 4: Elastic Transaction Trace|
In the scenario shown in Figure 4, the traffic file mimicking CPU data triggers a DMA transfer and then waits for completion before it continues. After improving the performance of the interconnect or memory, e.g. by increasing the clock frequency, the same workload model will execute differently: the CPU traffic is executed in a shorter time, so the DMA transfer can start earlier. These kinds of dependencies between traffic streams need to be captured as part of the workload model, otherwise it is not useful for architecture analysis, which is all about investigating the impact of changes in the architecture configuration on the performance in the system.
Automating the Creation of Trace Files
Although the STL file format is rather simple, it is also very verbose. Except for very short transaction sequences it is typically not recommended to manually write the STL files. Platform Architect provides utilities to generate STL files from an input description. The following STL generation utilities are provided:
When choosing the right utility, it is important to keep in mind the relevance of the generated address sequence. The DRAM performance is fairly dependent on the address sequence, since the address sequence determines if a pre-charge/activate delay occurs or not. Also the cache performance is highly dependent on the address sequence, since the address sequence determines if a cache hit or miss occurs. This in turn has a high impact on the request latency and the traffic load on the SoC infrastructure. In case a DRAM is part of the performance model even the stochastic STL generation utility should make an effort to achieve a certain accuracy of the address sequence. In case a functional cache model is used stochastic STL generation is not an option, because minor inaccuracies of the generated address sequence adulterate the analysis results.
Summary Part I
Today’s complex designs rely on optimal architecture configuration to achieve the right balance between latency, throughput, cost, and power. In the first part of this article we present a tool-assisted performance analysis flow for AMBA-based SoC designs. In the upcoming second part we will illustrated this flow based on the performance analysis of a typical multicore mobile SoC platform.
Tim Kogel received his diploma and PhD degree in electrical engineering with honors from Aachen University of Technology (RWTH), Aachen, Germany, in 1999 and 2005 respectively. He has authored a book and numerous technical and scientific publications on electronic system-level design of multi-processor system-on-chip platforms. Today, he is working as a Solution Architect at Synopsys Inc. In this position, he is responsible for the product definition and future direction of Synopsys' SystemC-based Platform Architect product line.