Published in Spring 2013 issue of Chip Design Magazine
Designing the Right Architecture Part IIPart II: A Mobile Platform Case-Study with ARM CoreLink™ NIC 301 Interconnect
Abstract Part II
Designing the right architecture of a multi-processor SoC for today’s sophisticated electronic products is a challenging task. The most critical element for meeting the performance requirements of the entire system is the interconnect and memory architecture. These SoC infrastructure IP components are highly configurable and need to be customized to the communication needs of all the other modules on the chip, such as the application processor, the graphics unit, and all the external connectivity IP. Finding the right configuration of the interconnect and memory IP to balance performance requirements and cost considerations requires an efficient performance analysis methodology, which allows for early and accurate investigation of architectural trade-offs.
In the first part of this article we presented a tool-assisted system-level performance analysis flow for interconnect and memory performance optimization using Synopsys Platform Architect. This environment allows the rapid creation of system-level performance models and the parallel simulation of many design configurations to investigate a wide range of architectural options.
In this second part we present the results of a design project, where Platform Architect has been used to optimize the performance of a multicore mobile platform SoC.
Mobile Platform with ARM CoreLink™ NIC 301 Interconnect Case-Study
The architecture analysis flow is demonstrated by means of a typical multicore mobile SoC platform design. The product application use-case is graphics processing of data received via a high-speed Internet link while the CPU is handling general OS tasks.
A high-level block diagram of the SoC platform is depicted in Figure 1. The design comprises two ARM Cortex A9 CPU clusters, a graphics processor, several additional initiators, two DDR2 memory controllers and several on-chip SRAM memories. The interconnect is based on ARM CoreLink™ NIC-301. The performance requirements driving the selection of the optimal architecture configuration are listed on the right side of Figure 1.
|Figure 1: SystemC Transaction Level Performance Model of Mobile Platform SoC (left) and Requirements (right)|
Creating a complete cycle accurate virtual prototype of such a complex SoC platform would be a significant effort and the resulting simulation speed would be very slow. The key idea to obtain a flexible performance model in a short timeframe is to focus on the relevant components:
- The interconnect fabric is based on SBL-301, the cycle accurate Platform Architect Bus Library of the ARM NIC-301.
- The DDR and SRAM memories are characterized using generic highly configurable memory models.
- All relevant initiators (Cortex clusters, graphics processor, etc) are represented by traffic generators, in this case Generic File Reader Bus Masters (GFRBMs).
- All low-bandwidth peripherals (such as UARTs, Timers, etc.) that do not contribute to the workload on the interconnect and memory subsystem are omitted from the performance model.
All the required components for building such a trace-driven performance model are available in the Platform Architect model library. This way the platform can be created and configured with very little effort. The interconnect and memory subsystem are the main subject of the performance investigation and therefore represented as timed models. The relevant IP blocks are represented as trace-driven bus masters, where the accuracy is determined by the quality of the trace files. Therefore the definition of the workload model plays a key role in the setup of the performance analysis project.
Workload Model Definition
As discussed earlier, there are multiple ways to obtain a flexible trace-driven workload model. In this project we use the trace file generation utilities provided by Platform Architect. The idea is to generate the trace from a high-level traffic specification. The traffic scenario to mimic the product use-case is depicted in Figure 2.
|Figure 2: Workload Model Specification|
Each line in this table characterizes one traffic stream. There two different types of traffic streams:
- The non-deterministic traffic of the Cortex CPU initiators is modeled using a random traffic generator.
- For the other initiators in the Mobile Platform SoC design example we use traffic generators that model more deterministic traffic as seen from data streaming initiators.
Both types of traffic streams have a set of common parameters, like e.g. initiator name, address range, burst size and time range. The thread-ID enables the generation of multiple concurrent streams from the same initiators, e.g. the traffic of the graphics block is modeled as three independent streams. The remaining parameters are specific for the respective type of traffic, e.g. the random traffic is characterized by the probabilities for load and store transactions.
Using these kinds of synthetic trace generators is the easiest approach to obtain an executable workload model based on some assumption on the traffic profile of the respective components. Typically a variety of scenarios is generated to cover the workload situations of the most critical product use-cases. This is the most productive method for early architecture analysis. Later the synthetic traces can be replaced with more accurate traces or accurate models of the actual components.
Looking at the Simulation Analysis Results
The interesting part begins after the model of the SoC platform and the application workload are available. We are now ready to execute the system performance model and to obtain a first round of analysis results. This allows us to evaluate how well the chosen interconnect and memory architecture serves the communication requirements of the given workload model. Typical metrics to evaluate performance and cost are transaction latency, throughput, and utilization of resources. Platform Architect presents the simulation results at 3 levels of detail:
- The most detailed view is a transaction trace, which shows start and end of every transaction in the system, including all the intermediated phases. This view is suitable for debugging and investigation of specific situations. On the other hand it is far too detailed to judge the overall performance.
- The most useful view for architecture optimization is the visualization of statistical performance metrics for latency, throughput, utilization, efficiency, contention, and number of outstanding transactions. These views show the dynamics of the activity over time and therefore allow the user to detect performance issues and to uncover cause-and-effect relationships. A head-to-head comparison of several different results in the same viewer is effective for root cause analysis, but does not scale to the dozens of simulation runs being compared during sensitivity analysis.
- For large scale sensitivity analysis Platform Architect aggregates the top-level metrics of any number of simulation runs into spreadsheets. This is the most productive way of investigating the most optimal design configuration, but does not allow a detailed investigation of why the results are the way they are.
Each of the 3 levels of detail contributes to identify the best candidate(s) or how to further optimize the architecture. In the following we discuss the provided views and what observation and conclusions can be derived from them.
The bus analysis view shows a large variety of performance metrics like, e.g. latency, throughput, utilization, efficiency, contention, number of outstanding transactions, etc. The example in Figure 3 shows the average read transaction duration aggregated into intervals, where the x-axis denotes time and the columns denote the bus initiators. The color-coding helps to quickly spot bottlenecks, in this case the transactions with the longest latency.
|Figure 3: Bus Analysis Results Identify Violations for Average Read Duration|
Obviously the total duration of more than 180 µs violates our 120 µs performance goal stated on the right side of Figure 1. Also the color coding highlights a red area for port 1 of CPU cluster 0. The measured 18 µs average read transaction duration during this period exceeds our constraint of 200ns by far.
By default the bus analysis view shows the end-to-end latency per initiator. You can also increase the level of detail and show how the transaction duration changes along the path. The screenshot shows the entire path from the “CORTEX_CL0.POST_1” initiator to the “PL341_0” target with all the intermediate nodes including the ASIB block, one High Performance Matrix (HPM) and the AMIB block. The transaction duration decreases along the paths, which clearly indicates where the transactions are aging due to queuing. The short duration at the end of the path denotes the pure DDR2 memory latency, after the transactions have passed the arbitration at the output stage. As illustrated by this example, this view allows quick detection of performance issues and locates them in the interconnect topology.
The “bus resource statistics” below provides a different perspective on the analysis results. Here we focus on the communication resources in the interconnect architecture, whereas the previous view provides an initiator oriented perspective.
|Figure 4: Bus Analysis Results Determining the Root Cause|
The resource metric in the upper part of Figure 4 is showing the contention of the read address channel, which indicates the relative waiting time due to arbitration on output stages. During the critical period between 20 and 60 µs we can observe the following:
- The total utilization of output stage op_aix_m_1 is 156% and 126%, meaning during this interval between 1 and 2 initiators are waiting for access to the output stage.
- The contribution from port 1 of cluster 0 is 98% and 93%, so this initiator has to wait for arbitration almost all the time.
This is a first indication that contention is the root cause of the long latencies observed in the previous view.
The lower part of Figure 4 shows the utilization of the read data channels of the respective AXI connections. The list of resources in the middle column is rather long, because each input stage, output stage, bridge, ASIB, and AMIB is listed as a dedicated resource. Now we can see that the utilization of memory data ports (at the top of the list) is fairly low. This indicates that the available memory bandwidth is not well used.
To summarize the detailed investigation of the performance analysis results, some of the initiators show excessively long transaction durations. These initiators also show high contention in trying to access shared resources. Despite the high demand for memory bandwidth, the utilization of the memory channels is rather low. Hence the root cause of the performance issues seems to be in the poor utilization of the available memory bandwidth. A likely reason are the long memory latencies (25 cycles) in combination with few outstanding transactions (2), which throttles the effective throughput. This hypothesis can be confirmed in the transaction trace view.
The transaction trace shows the start and end of every transaction in the system, including all the intermediate phases. Figure 5 is zooming into a trace of two simulation runs with different design configurations. The color coding indicates red for write transactions and green for read transactions. By default each transaction stream is represented as a single line. As shown here for the PL341 memory controller port, you can selectively increase the level of detail to visualize overlapping transactions and even individual transaction phases. The upper view shows the trace of the initial design configuration, where the number of outstanding transactions on the DDR2 memory interfaces is limited to two. Hence the memory can process only one read and only one write transaction at a time. Together with the long DDR2 memory latency this explains the low utilization of the data channel and the next transaction can only start when the previous one has been finished.
To improve the utilization of the memory data channels we increase the number of outstanding transactions to eight. We re-run the simulation and open the new results as shown in the bottom of Figure 5.
|Figure 5: Transaction Trace Comparison of two simulations with 2 (upper) and 8 (lower) outstanding transactions on the memory|
The effect of the number of outstanding transactions becomes apparent. Now up to four concurrent read and write transactions are active on the memory interfaces. As a result, the data channels are much better utilized.
Doing a head-to-head comparison of detailed analysis views does not scale beyond a small number of simulation results. Large-scale comparison of many results is much more productive with sensitivity analysis using spreadsheets.
Platform Architect provides automation for sweeping design parameters across multiple simulation runs. As shown in Figure 6, the basic idea is to provide a spreadsheet defining the design parameter permutations, run the simulations, and consolidate the analysis results.
- The input scenario specifies the set of simulations with their respective design parameter settings. A scenario is specified in terms of a Comma Separated Value (CSV) table, where each row denotes a single simulation and each column denotes a design parameter value.
- Platform Architect is then able to execute the scenario file. The analysis results from each simulation are stored in a separate analysis database. Additionally, the top-level performance metrics are consolidated into a results table.
- The results table is similar to the scenario file, but with additional columns for the analysis metrics. The results table can be immediately converted into a Pivot Chart using common spreadsheet tools like Microsoft Excel or Star Office Calc.
In the following we discuss two exemplary Pivot Charts, which have been generated from a simulation sweep of our Mobile Platform example. This sweep comprises 108 simulations, where we modified a number of outstanding transactions on the DDR2 memory ports, the DDR2 memory frequency, and the number of outstanding transactions on the graphics and CPU initiators.
The first Pivot Chart in Figure 7 shows the total runtime of the workload scenario on the y-axis depending on the number of outstanding transactions on the memory port on the x-axis and the DDR2 frequency as a parameter. Based on this perspective the following observations can be made:
- Having only 2 outstanding transactions on the memory ports clearly violates the performance goal for a total runtime of 120 µs. As we know from the previous discussion of the bus analysis and trace views, this is due to the low utilization of the memory data channels. All other configurations meet the runtime requirements.
- There is no significant difference between 8 and 16 outstanding transactions on the memory port. Increasing this design parameter impacts cost and power because of the necessary registers. Therefore the number of outstanding transactions should be as high as necessary, but also as small as possible.
- Increasing the memory speed has no significant impact on the total runtime of the simulation scenario. This is because the speed advantage of increasing the memory frequency from 400 to 533 and 667 MHz is compensated by the increased CAS latency of 3, 4, and 5 cycles respectively.
Based on these observations we narrow the design space to those results, where the number of outstanding transactions on the memory port is set to 8 and the DDR2 frequency is set to 400 MHz.
Now consider the maximum transaction duration of the two initiator ports of the Cortex CPU cluster 0. On the x-axis we vary the number of outstanding transactions of the CPU initiators and as an additional parameter we vary the number of outstanding transactions of the graphics initiator. The 500ns constraint on this metric is violated for port 1 when the number of outstanding transactions on the Cortex initiator is 8. In this case the effect of the GPX outstanding parameter is most significant, because the additional throughput further impacts the already suffering AXI 1 port of the shown Cortex CPU cluster 0.
Interestingly we can observe an opposite trend for the two ports of this cluster, where increasing the number of outstanding transactions on the CPU initiators results in decreasing transaction duration for AXI0 and increasing transaction duration for AXI1. This is because AXI0 has a higher priority than AXI1. Hence AXI0 benefits from the deeper pipelining at the expense of AXI1. In other words, AXI0 is flooding the interconnect and starving the AXI1 port. This behavior can be changed by modifying the priority settings or by simply limiting the number of outstanding CPU transactions to 4.
Analyzing the average transaction duration of the initiator ports shows that allowing only 2 outstanding transactions on the CPU ports violates the average transaction constraint of 200ns.
The cheapest interconnect configuration to satisfy all requirements of our case study is with 4 outstanding transactions from the CPU initiators, 2 outstanding transactions from the GFX block and 8 outstanding transactions on a DDR2-400 memory.
Today’s complex designs rely on optimal architecture configuration to achieve the right balance between latency, throughput, cost, and power. This whitepaper presents a tool-assisted performance analysis flow for AMBA-based SoC designs. Using Synopsys Platform Architect enables the rapid creation of workload based performance models, which allow exploring hundreds of design alternatives using a very efficient “spreadsheet-in/spreadsheet-out” approach. The system architect can investigate performance issues and identify the potential for further optimization using the provided statistical analysis and trace views. This enables the optimization of system performance long before software becomes available. We have illustrated this flow based on the performance analysis of a typical multicore mobile SoC platform. Although this example is based on ARM CoreLink™ NIC 301, the Platform Architect model library also supports other interconnect architectures, including Synopsys DesignWare AXI and Arteris FlexNoC.
Tim Kogel received his diploma and PhD degree in electrical engineering with honors from Aachen University of Technology (RWTH), Aachen, Germany, in 1999 and 2005 respectively. He has authored a book and numerous technical and scientific publications on electronic system-level design of multi-processor system-on-chip platforms. Today, he is working as a Solution Architect at Synopsys Inc. In this position, he is responsible for the product definition and future direction of Synopsys' SystemC-based Platform Architect product line.