By Ed Sperling
As complexity increases and the industry increasingly shifts away from ASICs to SoCs, the concept of coherency is beginning to look more like a stack of issues than a discrete piece of the design.
There are at least five levels of coherency that need to be considered already, with more likely to surface as stacked die become mainstream over the next few years. Perhaps even more mind-numbing, this stack itself will have to take on a level of coherency over the couple generations of chips.
Let’s take a closer look.
The concept of keeping data coherent historically was relegated to processor makers such as IBM, Intel and AMD, which have focused on improving performance through faster access to data. One solution to that improved performance has been multithreading and multiprocessing. Along with that, these vendors have added in various levels of cache memory for faster recall of important data.
More cores also makes it harder to effectively use these caches. Data has to be kept consistent, which requires more system overhead in terms of processing and power just to maintain that coherency. And it gets even harder as more cores are added into an SoC, which increasingly are not same size, do not run at the same frequency, and sometimes do not even connect directly to the main CPU.
“With cache coherency, some of the traffic may be serviced by the cache on another GPU,” said Drew Wingard, CTO at Sonics. “If you’re just using an ARM core, the CPU coherence is sufficient. But the GPU uses its own local memory. You really want it to be fully cache coherent across all of those.”
But even finding the data to maintain consistency may be a problem in a complex SoC.
“You can view what’s in memory, or view it and be able to change what’s in memory, but first you have to find it,” said Kurt Shuler, vice president of marketing at Arteris. “If you have four cores, the most efficient way to hook them up is for each core to have its own cache and graphics to have its own cache. If you change something, you have to snoop in all the caches to make sure it’s consistent.”
But there is also a move in the completely opposite direction—sharing memories among multiple cores—because it reduces the number of components on the bill of materials. The Low-Latency Interface specification from the MIPI Alliance is a case in point, where a memory can be shared between a modem and an applications processor. Intel, meanwhile, has added on-chip graphics that share memory with the CPU.
“The whole design gets more complex,” said Shuler. “You have more traffic beyond the cores, and from a power standpoint the overhead goes up.”
Still, cache coherency is one of the better-understood pieces of this stack. It has been an issue ever since multiprocessing was first employed in the 1960s. “Snooping” has been widely used since that time.
A newer facet of coherency involves embedded software. Because SoCs now include an increasing amount of software in the design, engineering teams now have to wrestle with coherency issues that previously were dealt with by the operating system.
“Fundamentally you’ve got two combined issues here,” said Andy Meyer, verification architect for Mentor Graphics’ Design Verification Technology Division. “You’ve got cache coherency, where the same data is being viewed in a couple places. And then you’ve got an issue with consistency in the simple code in a uniprocessor that now has to run on a second processor. The ordering of events can change in multiprocessing.”
Those problems crop up regularly in verification, but not always with the expected results. It’s difficult to effectively write the stimulus in a testbench for coherency. What happens, for example, when a core is shut down to save power?
“The scariest part is when there is no OS support,” said Meyer. “There’s also a big problem with heterogeneous cache, such as when you have a CPU working with a GPU.”
Another issue has to do with effective coverage in verification, already a problem for complex SoCs. States frequently are distributed across multiple chips and multiple boards. Timing varies from one state to another, and can be particularly problematic if snooping functions are tied to a state. And parallelism continues to baffle even the most advanced teams.
“Standard coverage methods don’t work well here,” said Meyer. “You have to query in ways you traditionally didn’t have the power to query and ask questions across months of regressions. For instance, ‘Have we been here ever—or in the last two months.’ Until coverage steps up, people with deep knowledge of verification running hundreds of full-time emulator systems are finding out at the last minute that it’s not okay to ship.”
Tied in with both cache coherency and software coherency is I/O coherency. Increased communication on a chip, between chips, and between a chip and the outside world, have turned what used to be a relatively straightforward networking issue into a complex jumble of prioritization and synchronization.
“You have to deal with this even in single processors,” said Sonics’ Wingard. “You may have a PCI core streaming data into memory. Today, without I/O coherence, it’s difficult to determine what is coming in. The CPU has no way of knowing what was transferred when it dos a copy from non-cache to cache.”
He noted that personal computers had I/O coherency for a long time, particularly with direct memory access. DMA was developed initially to help solve the bottleneck that occurred when a CPU was involved in an I/O transfer. Rather than tie up the CPU with that transfer, the CPU continued running, then accepted an interrupt when the transfer was completed.
But with more of this being moved onto a chip, keeping coherency while moving data back and forth from more places is becoming much more difficult.
One of the least addressed facets of the coherency stack involves business and communication issues across a supply chain for a particular SoC rather than the actually technology itself. Even where competitive suspicions can be overcome, the very different approaches taken for designing components, IP and software, as well as language barriers, create one of the more difficult and less tangible challenges in the coherency stack.
“The challenge going forward is that you have a bunch of people who may not be that skilled in system development driving the chip and spec for one design, and other supplier trying to orchestrate things,” said Mike Gianfagna, vice president of marketing at Atrenta. “So you bring them together to solve a problem for one customer in 12 weeks and then they move on. You’ve got corporations coming together and bringing all these pieces together almost like the way a movie is done. But is there a coherent way to communicate data and information risks and still provide good visibility from a power/performance/area point of view?”
For decades this task has been handled by IDMs, but in the SoC world there are far fewer IDMs these days. Many of these chips are built using third-party IP such as cores from ARM or MIPS, DSPs from companies such as Tensilica, and standard IP from the Big Three EDA vendors.
Coherency in stacked die
It’s uncertain whether stacking of die, either in 2.5D or 3D configurations will make coherency easier or harder. The answer is likely to be a little of both.
“With 2.5D and 3D, you’re looking at low-power memory access,” said Arteris’ Shuler. “You put the DRAM closer to the CPU, the addressing is wider and you get rid of some of the latency. But you also need coherency across all of this.”
No one is sure yet how multiple high-speed communication channels between die will affect coherency. If the channel between the core is wider and shorter that will improve data speed, but if processors and DRAM are scattered on multiple die, with some of them shut down, some partially shut down, and others fully active, it may make it harder to keep track of data and make sure it is all synchronized.