Rethinking SoC Architectures
Thursday, May 24th, 2012By Ed Sperling and John Blyler
Virtualization and coherency, two concepts that can trace their origins back several decades, are suddenly gaining attention these days—but for entirely different reasons and uses.
A good way to think about virtualization is as an opportunistic use of available resources. Rather than waiting in a queue for a single processor core in a multicore SoC, for example, virtualization allows a compute task to take advantage of whatever processor is available if another is in use.
The concept is hardly a new one. Virtualization was invented by IBM in the late 1960s as a way of running batch processing while also still doing other work. But virtualization also creates a challenge for keeping caches in sync, which is why the concept of cache coherency was created. And as more cores are added into SoCs, rather than more processors within a single machine or on multiple machines, cache coherency has moved from mainframe to PC to processor and now across multiple processor cores.
What’s changing now is that these concepts are spreading well beyond just the processors. Virtualization is being applied to memory, storage, I/O and graphics processing units (GPU)s. But to make all of that work efficiently coherency will have to grow well beyond just the cache, and that may prove to be a very difficult problem, particularly in a multivendor ecosystem.
The starting point
Much of this shift has come into focus inside large data centers over the past decade as a way of reducing costs. In the 1990s, the availability of inexpensive blade servers and ever-increasing density made it possible to begin replacing expensive mainframes and minicomputers with off-the-shelf commodity machines. You could stuff them into a single cabinet and blast in chilled air to cool them sufficiently. Two things happened to change this equation. One is that the cost of electricity suddenly went up, because these cabinets were running hotter than ever before as density and current leakage increased. The second was that data centers had bought so many servers over a period of 15 years that just the cost of keeping them running was beginning to show up as seven-figure annual expenses for many large data centers.
Virtualization proved an effective way of reducing that cost because it allowed data centers to increase server utilization from an average of 5% to 15% utilization all the way up to 85% or more. That meant fewer servers overall, less electricity, less heat to remove, and far more available real estate. But with that problem now under control—at least for the moment—data center managers have shifted their attention to the exponential rise in the amount of data being stored. In the 1990s, most of the data was simply text or code. It is now a combination of text, video, data and voice, raising the same kinds of fiscal red flags about powering and cooling storage as servers prior to virtualization.
“What we’re seeing now is a move toward virtualized storage,” said Bob Pierce, flash business development group director at Cadence. “The next step is to merge storage and memory, which is why we’re seeing such strong interest in PCI Express. It’s a great transport vehicle. We’re also going to see back-end storage mixing with front-end storage data. What’s in between will be cache in the form of virtualized memory.”
This more fluid boundary between storage and cache has ramifications at all levels of design. It can affect everything from a processor to multiple processor cores on a single SoC, on multiple chips in a stacked die, and on multiple systems in a grid or mesh network.
“What’s happening is that you’re moving the back end closer to the processor,” said Pierce. “It changes the way big data and databases will be addressed in the future. If you have four CPUs, you can take them and, using PCI Express, prioritize them into a given drive sector and share them. That’s where all the VC startup money is these days. It’s the ability to configure servers for the function necessary at any given time. But you also have to virtualize storage and memory, and it has to be done dynamically.”
PCI Express has the dual advantage of adding a single protocol to keep all of this data coherent. While it’s useful to store and retrieve data quickly, it all has to be updated to reflect any changes that were made in any part of the system.
Adding other resources
Mixing storage and cache is fairly obvious, though. Less obvious is the mixing of processing between CPUs and GPUs.
“In the past, GPUs were directly assigned to a virtual machine (VM),” said Sumit Gupta, senior director of Tesla GPU Computing at Nvidia. “Every VM would get a full GPU, which meant that each server was limited by the number of GPUs it could hold.”
Nvidia’s new Kepler GPU architecture uses more cores—192 vs. 32—compared with its predecessors, and a significantly lower frequency of .175GHz compared with the old 1.35GHz. The result is faster processing with less power.
“We invented several technologies in order to virtualize the CPU, including an improved Memory Management Unit in the GPU,” he noted. “This is key because most of the data acted upon by the GPU comes from memory.”
But memory is being virtualized, as well, making this whole scheme even more complicated. Startup Memoir Systems, which touts its solution as algorithmic memory, is an intelligent virtualization scheme for almost any available memory in a system. And there are moves afoot to do the same for the multiple I/O feeds to improve the speed of downloads and uploads from a system.
Making it all work together
While virtualization all makes sense from a performance standpoint, complex systems aren’t just about performance. Coherency is a critical piece, and it’s an extremely difficult one.
“The reality is that I/O coherency has been around for a long time in the x86 world,” said Laurent Moll, CTO at Arteris (and formerly a systems architect at both Nvidia and Broadcom). “The next frontier is when you start adding in other devices, and there’s a big disruption when you’re adding full coherency between the CPU and other things. It’s easiest when you have a small team designing the cache and all the protocols. When you start plugging multiple things together it gets a lot harder. You need to be a lot clearer about the specification, the verification and the tests that need to be run.”
He said there are two key challenges in this scheme. One is simply getting it right, which is difficult for multiple companies using different teams and with different cultures. “It’s very easy to have corner cases that the guys who wrote the spec didn’t think about,” he said. The second challenge is that there is no known path to do this. Quite simply, it has never been done before.”
Conclusion
The upside of getting this right is a huge boost in performance. Being able to utilize more resources at any time can improve speed on almost every part of a chip or system, and virtualization plus coherency is a big win for the user.
The downside is that, assuming this can be done in the first place, it also could have an impact on power. The whole goal of most advanced SoC designs is to keep the majority of silicon dark except when it’s needed, and even then to run at maximum performance for a very short time to get everything done quickly. Having more resources to manage on an ad hoc basis solves the use model issue for performance, but it can create havoc on power management schemes.
In addition, it may require new software to even work in the first place. Cadence’s Pierce said some of this won’t even make sense on platforms such as Android until the multithreaded OS release called Ice Cream Sandwich becomes more prevalent.


