The End of a Necessary Evil: Collapsing the Memory Hierarchy

Four hours and 31 minutes.  That’s how much longer my HP Folio 1040 laptop estimates I can work given the energy stored in its battery.  Where does all that energy go? My CPU meter shows only slight activity. I’m only running a couple of apps and I’m not typing that fast.

It turns out that today’s computers spend most of their time and energy shuffling data between tiers of storage and memory. In modern systems, this hierarchy can be more than 10 layers deep. In geek land, we call this “the volatility chain.” We’re all so used to this that few of us ever think to question it. But on the face of it, this is an odd way to go about computing. Why not hold all your data in main memory, all of the time?

In this post, I want to take a look at how we came to work the way we do, and what comes next.

Why do we have a hierarchy?

It’s a question of scarcity. To keep up with the processor, you need the fastest memory possible. Since the 1970s, the fastest memories have required continuous power. Computers have always been built with as much fast memory as a user can afford, and the required capacity comes from cheaper, but slower, technologies. The memory hierarchy evolved because fast memory is expensive to both buy and run. A primary task of an operating system is to manage this hierarchy, delivering the right data to applications on demand and filing away the results.

Historically, this hierarchy has been effective. It was a brilliant way of achieving the necessary price/performance combination users needed—and it has worked for decades. However, we believe we’re reaching a point where the memory hierarchy is holding us back. Today, scientists, mathematicians and economists are spending their careers working out how to perform their calculations instead of doing their actual work. They are forced to translate simple equations into complex parallel processing tasks because we can’t afford to buy, run or efficiently program computers with enough horsepower to accomplish what we need to.

The top of the hierarchy—fast, but expensive and volatile

The hierarchy of memory comprises three major layers: SRAM is used for on-chip cache memory. DRAM is used for main memory and mass storage is provided by Flash and hard disk drives. Let’s start by looking at the first two layers:

  • The memory that shares silicon with the microprocessor is called SRAM. Each bit is stored in a network of (usually) six transistors. Speed is paramount, because the SRAM has to keep up with the gigahertz pace of the microprocessor. The problem is that SRAM cells can take up most of the space on the chip (the most expensive real estate on the planet!). They’re also the most difficult transistors to run reliably at low voltages and high frequencies, making them difficult and expensive to fabricate.
  • DRAM stores information as electric charge in a capacitor. The problem is that a DRAM capacitor is a leaky bucket of electrons. You have to keep refilling the capacitor every few milliseconds or the data will be lost. This wastes time—you can’t access data while a refresh is in progress—and power. As DRAM cells scale down, these twin problems get progressively worse.

At this point in the hierarchy, we cross an important boundary, the one between volatile and non-volatile.  SRAM and DRAM remember only by continuously burning power. But some results of computation need to be recorded permanently. Of course, power loss is only one of the types of failure that we need to guard against, but given the consequences we all know of unexpected power loss, this boundary is an important one. 

The bottom of the hierarchy—slow but cheap and permanent

        The final layer of our memory hierarchy retains information in the absence of power:

        • Flash is growing in popularity for at least part of mass storage today. It’s much quicker than a hard drive, but still very slow compared to DRAM. Flash is slow because data must be written and read—flashed—in large blocks. It’s like picking up a dictionary when you only want a single word. This speed limitation isn’t a problem today because of the memory hierarchy: we have SRAM and DRAM to do the rapid work.​ Flash also has a surprisingly low limit to how many times it can be erased and re-written. SRAM and DRAM have effectively unlimited lifetimes, but Flash cells can break down after as few as 10,000 cycles. For this reason, it can’t be used for data-intensive main memory tasks. Dead cells are masked from the user by control circuitry that “walls off” damaged parts of the chip, but reliability is still an issue.  When you use that 2GB thumb drive, you’re getting access to 2GB of capacity over the estimated lifetime of the device.  There is actually more raw storage there, with spares to account for wear out.
        • Hard disk drives are used for the bulk of mass storage today. Although Flash is catching up, hard drives still offer the lowest cost-per-bit this side of magnetic tape. But they’re glacially slow and energy-inefficient. They’re also inconsistent: if two blocks of data happen to be next to each other then it’s not too bad. But if the blocks are far apart, millions of clock cycles can be wasted while the read/write head drags itself across the platter. Newton still matters, and F=ma still means that moving drive heads and rotating platters burns energy.

        Why haven’t I heard about this before?

        The industry has developed all kinds of tricks to mask the delays caused by the memory hierarchy. Most use sophisticated algorithms to predict and deliver that data that will be needed next, such as caching and prefetching commonly used data. Thousands of people have devoted their careers to improving these techniques, generating millions of lines of code in the process.  But what happens to those tricks when instead of working with the predictable behavior of a 1990s-era business database application we start working on social networking applications of 2014?  Those tricks not only fail to speed things up, they can actually slow everything down.

        What is universal memory?

        If we could find a memory technology that is as fast and durable as DRAM, and as cheap as Flash and hard drives, we could combine—collapse—multiple layers of hierarchy. We call this combination of main memory and mass storage “universal memory.” Suddenly, the job of the operating system becomes massively simplified. Applications no longer have to be written to split large tasks into pieces so they can fit into a few gigabytes—or maybe a few terabytes for the small number of organizations that can afford it. With universal memory, petabyte-sized data sets can be held in memory to solve problems impossible to even attempt today.

        Universal memory remains a somewhat controversial topic. There are those who regard it as little better than a myth. After all, each of the technologies we use today excels at its assigned task. Wouldn’t forcing a compromise technology into use for multiple tiers result in a poorer experience for the end user?

        The answer is that we’re fast approaching an inflection point where the exponential increase in data—and the need to ingest and derive value from that data—will be beyond the capabilities of our current compute model. And even if we could extend the current model to cope, we wouldn’t be able to generate enough electricity to power it all.

        How is HP solving this problem?

        We believe that Memristor memory, which is being commercialized by HP, is the ideal universal memory vehicle. It’s fast, incredibly energy-efficient and can be packed extremely tightly on a chip.

        How fast is a Memristor? Our latest Memristors can store a bit of data in less than 100 picoseconds. Light itself would only have travelled an inch in that time.

        This should all translate to a low cost-per-bit as we move from pilot production into high-volume fabrication.

        While many other “next-generation” memory technologies have been proposed—some of them are even in limited production today—each of them has a fatal flaw that prevents it from being useful as a universal memory.

        Universal memory, powered by Memristor, is at the core of HP Labs’ biggest research project: The Machine. Special-purpose processing cores connected to massive pools of universal memory via a high-bandwidth optical communications fabric are designed to deliver quantum leaps in performance and power efficiency.

        Universal memory is no longer a myth. It’s an imperative.

        Want to learn more?

        We’ve got some recommended viewing for those who want to learn more.

        Watch Martin Fink announce The Machine at HP Discover 2014:

        Learn how HP Labs is innovating the future of technology: