Early Optimization of Multicore SoC Architectures Using System-level Design Methods and NoC Interconnect Technology
Discovering system performance problems late in the development cycle can be catastrophic to product schedules and competitiveness. To predict the dynamic system performance of multi-core system-on-chip (SoC) architectures and avoid the high cost of over-design, system-level simulation and performance analysis must begin earlier in the design cycle, before the application software is available. This is the best practice used by most successful SoC design teams.
Using the combination of system-level design methods for architecture performance analysis and SystemC TLM models of network-on-chip (NoC) interconnect technology reduces product development risk and design cycle time. NoC interconnects offer the ability to reduce the number of interconnect wires and logic required for multicore SoC designs. Reducing wires and logic gates resolves routing congestion and timing closure issues at the back-end place-and-route stage, resulting in higher product margins by shrinking development cycle time, increasing SoC frequencies, decreasing SoC area and reducing SoC power.
This article illustrates the best practices used by leading SoC architecture teams to quickly arrive at candidate architectures. We show how a SystemC TLM standards-based graphical environment is used to capture, configure, simulate and analyze the system-level performance of next-generation SoC architectures. The efficient turnaround time, system-level analysis views and available models enable more realistic system simulation and early optimization of multicore SoC architectures using NoC interconnect technology. A mobile device example is used to illustrate this approach.
|Figure 1: Mobile Platform Performance Model Block Diagram Featuring Arteris FlexNoC Network-on-Chip Interconnect|
Mobile Device Case Study
The mobile device design comprises two ARM Cortex-A9 processor clusters, a graphics processor, several additional initiators, two memory controllers and additional memories. At the heart of the system is the Arteris FlexNoC Network-on-Chip Interconnect. Figure 1 depicts our performance model.
The interconnect fabric is represented by a SystemC TLM-2.0 architecture model of an Arteris FlexNoC interconnect. All relevant initiators (Cortex clusters, GPU, etc.) are represented by Generic File Reader Bus Masters (GFRBMs) from the Synopsys TLM library. The memories are also from the TLM library. Low-bandwidth peripherals (such as UARTs, timers, etc.) that do not contribute to the workload placed on the interconnect and memory subsystem can be omitted from the performance model. This way the entire performance model can be rapidly constructed based on available library elements.
Define Your Project Clearly - And Go With The Flow!
Latency, throughput, quality of service, cost and power – to achieve the best architecture, the requirements of your multicore SoC platform must be understood within the context of the overall system. By choosing a traffic scenario to emulate the effect of dynamic application workloads, you can enable realistic simulation before software is available. For our example mobile device, we will focus on the graphics processing of high-speed incoming data from the Internet while handling general OS tasks.
Our overall performance goal is to finish the execution of this representative traffic use-case in less than 100 microseconds (µs). Plus, two performance constraints are defined for Cortex transaction latencies: the average duration must be below 100 nanoseconds (ns), and the maximum duration must be below 300 ns. Lastly, the optimal configuration must minimize cost by reducing the number of outstanding transactions where possible, thus saving register hardware, and choosing the memory speed to avoid overdesign. With clear project goals, we’re ready to go with the flow in Figure 2.
|Figure 2: System-Level Design Flow for SoC Performance Analysis And Optimization Featuring Synopsys Platform Architect|
Step 1: NoC Specification in Arteris FlexNoC
First we define the SoC Network-on-Chip interconnect topology in the Arteris FlexNoC tool. We specify the external interfaces, three different clock domains, the initiator and target sockets with their clock frequencies and protocols, and the memory map. In our example we have a full matrix in the request and in the response path. Then we define the performance-related parameters, like the number of outstanding transactions on each network interface unit. Based on this information the FlexNoC tool generates the interconnect structure shown in Figure 3, including the detailed hardware components inside the NoC.
|Figure 3: Full Matrix NoC Topology with Multiple Clock Domains|
FlexNoC also generates SystemC models of the NoC configuration at different levels of abstraction. These SystemC models have TLM-2.0 compliant interfaces and are easily imported into Platform Architect. The different levels of abstraction are packaged into one block, so one can easily select the level desired. The SystemC importer automatically detects the interfaces for clock, reset, initiators and targets of the NoC model. The tunable parameters in the generated SystemC model can be configured from within Synopsys Platform Architect. This way, we can sweep design parameters and analyze their effect on the system performance without generating a new interconnect model.
Step 2: System Configuration and Simulation in Synopsys Platform Architect
The initiators in our example are File Reader Bus Masters (FRBMs) from the Synopsys TLM model library. They have TLM-2.0 approximately timed interfaces and connect directly to the FlexNoC model. The models for the DDR and SRAM memories on the target side are also from the TLM model library.
Our traffic scenario is described in tabular format, where we specify the transaction streams for each initiator. From this table the tool generates the trace files to be executed by the FRBMs. Prior to running the simulation, we assemble the system, define the memory map and select the analysis views we want to enable, such as the detailed tracing and statistical analysis views in Platform Architect. We can also enable VCD tracing inside the FlexNoC model, which provides additional insight into the NoC internals. For system configuration, we can set the number of outstanding transactions on the initiator and target sides to 2, 4, or 8. DDR latency can be set to 20 or 40 cycles. SRAM latency can be set to 5 or 10 cycles. After initial simulation we analyze the recorded results.
|Figure 4: Performance Model Traffic Table and SystemC Block Diagram|
Step 3: Root Cause Performance Analysis
The detailed transaction trace shows every single transaction on every port in the system. Zooming in we can identify individual read and write transactions with their request and response phases. At this level of detail, we can observe that there is little pipelining because we initially have limited the number of outstanding transactions to 2.
The bus analysis view shows the statistical aggregation of the simulation results. We can view the number of read transactions for each initiator aggregated over time and split into intervals. Very quickly we observe the total runtime of the entire scenario is more than 260 micro-seconds, well over the goal. Looking at the transaction duration, we see that the average read duration of the DMA is 450ns. The maximum read duration for port 1 of Cortex cluster 1 climbs as high as 1200ns. We also see that the utilization of the memory interface (request channel) is very poor.
To gain additional insight, we load the recorded VCD trace from the FlexNoC model back into the Arteris tool. To improve the utilization of the memory interface we need to increase the number of outstanding transactions. We do this with a global parameter that is propagated to the individual parameters in the FlexNoC model. After re-running the simulation, we can compare the new results with the previous configuration. In this way we iteratively improve the performance of the system.
|Figure 5: Comparison of Statistical Bus Analysis Results|
Step 4: Sensitivity Analysis for Optimization of Performance, Cost and Power
In the last part of our multi-core performance optimization project, we want to analyze the results of a parameter sweep using Microsoft Excel. The full set of design parameters and their variations span 36 Platform Architect simulations, each with their own set of performance analysis results for our metrics: The total runtime of the simulation; average read durations; average write durations; and maximum read and maximum write durations for all the initiators. Results are extracted into Excel, where we can use pivot charts to analyze the impact of those design parameters on the metrics.
|Figure 6: Sensitivity Analysis Using Pivot Charts|
We first identify which parameter configurations violate our constraint of having a total runtime of more than 100 microseconds, which are the ones extending above the line in the graph above, and exclude those from our analysis. We also exclude all the simulations where the number of outstanding transactions on the target side is 2 because these violate the constraint.
We next exclude the columns where the number of outstanding transactions on the initiator side is 2. We then examine other design constraints such as the average read duration, which is supposed to be below 100 nanoseconds. We see that every time the DDR latency is 40 nanoseconds, this constraint is violated. We eliminate these from our solution set.
For the remaining configurations, we check if any others violate our constraints. There is one more configuration, 8-20-5 (meaning 8 outstanding transactions on the initiator side, 20 DDR delay, 5 SRAM delay that exceeds the maximum read duration of 300 nanoseconds and is excluded.
And the Winner Is…
All the remaining configurations are valid. We now select the configuration that gives us the least expensive approach to implement the NoC by having the least number of outstanding transactions. FlexNoC helps us in this analysis by providing NoC area estimates for any interconnect configuration generated from the FlexNoC tool. The 4-20-10 configuration is the optimal solution, using only 4 outstanding transactions and the slower, less expensive memory.
|Figure 7: Optimized Configuration for Performance and Cost|
What we have described above is a design flow that leading SoC architecture teams use today to quickly arrive at candidate SoC architectures early in the design process. In this example our metrics were primarily performance related, but in the real world additional metrics such as power consumption, interconnect area and whether low VT cells are required to meet design frequencies are also considered. Lessons learned from these successful teams include:
- Define project goals clearly
- Focus on critical use cases that must be met to achieve system performance, power, and cost
- Assemble the performance model based on these requirements, and no more
- Start simulation earlier using realistic traffic scenarios, before software is available
- Avoid guesswork. Use realistic system simulation, quantifiable results, root cause, and sensitivity analysis to understand tradeoffs and make confident decisions that avoid overdesign
In this article, we showed how one can use system-level design methods at the earliest stages of SoC design to determine the optimal interconnect configuration that meets a set of performance goals and system constraints. With a system-level environment like Synopsys Platform Architect and configurable SoC interconnect IP like Arteris FlexNoC, it is possible to quickly create and run multiple simulations that explore and optimize possible configurations.
Kurt Shuler is the vice president of marketing at Arteris. Prior to Arteris, Kurt Shuler held senior marketing and product management roles at Intel, Texas Instruments, ARC International and two startups, Virtio and Tenison. He has extensive IP, semiconductor and software marketing experience in the mobile, consumer electronics and enterprise server markets. Before working in high technology, Kurt flew as an air commando in the U.S. Air Force Special Operations Forces. Mr. Shuler earned a B.S. in Aeronautical Engineering from the United States Air Force Academy and an M.B.A. from the MIT Sloan School of Management.
Patrick Sheridan is responsible for Synopsys' system-level solution for multicore platform architecture design. In addition to his responsibilities at Synopsys, from 2005 through 2011 he served as the Executive Director of the Open SystemC Initiative (now part of the Accellera Systems Initiative). Mr. Sheridan has 28 years of experience in the marketing and business development of high technology hardware and software products. Prior to joining Synopsys he worked at CoWare, Hewlett-Packard, Cadence Design Systems, and provided marketing consulting to successful start-up companies in Silicon Valley. Mr. Sheridan has a BS in Computer Engineering from Iowa State University.