Demystifying Memory Sub-systems Part1: Caches

Demystifying Memory Sub-systems Part1: Caches

Introduction

In this and the next article I want to cover the topic of memory sub-systems, discussing subjects such as caches, memory protection, virtual memory, and memory-management systems. It is my experience over the years that some engineers seem to think these topics are too difficult and specialist to understand—even ‘magic’. This depends where one finds oneself working, of course, and I have worked in environments where these functions need to be implemented and expertise is at hand. So in these articles I want to demystify these topics and show that, even if this is not an area of expertise or experience it is not really a difficult subject if approached at the right pace and appreciate the context of what problems are being solved by these systems. I also think that many logic and embedded software engineers will at least be working within a processor based SoC system with these type of memory sub-systems, such as an ARM or RISC-V based SoC and knowing how these systems work is extremely helpful in understanding how to use them effectively, and to debug issues when constructing a solution based on them. This first article, though, will concentrate on caches, what types are common and why they are employed.

A History Lesson

In 1981, at the early stages of the personal computer revolution, several PC computers were launched; the Sinclair ZX81, The BBC Micro, the Commodore 64 and the IBM PC, amongst others. These systems had certain system attributes in common, such as an 8-bit microprocessor, RAM, ROM and peripherals for keyboard, sound, graphics, tape/disk interface and programmable input/output. The operating system might be stored on the ROM and programs loaded into RAM. When the computer started running it would read instructions over its memory mapped bus directly from ROM and/or RAM. Memory was expensive and these early systems were limited to tens of kilobytes (or less—the ZX81 was shipped with 1Kbyte). The dynamic random-access memory (DRAM) used at the time had comparable access times to the CPU clock rate and the CPU was not stalled on instruction accesses. In some cases, where video RAM was shared with the CPU it might even be faster. The BBC micro, for example, had a 2MHz CPU, with RAM running at 4MHz, interleaving video and CPU access, without stalling either.

As time progressed and CPUs became faster, accesses to memory could not keep up, particularly with the introduction of synchronous DRAM (SDRAM). A CPU system can function perfectly well with the slower memory, but each load or store between the CPU and memory introduces wait states that slows down the effective performance of the CPU. The solution was to introduce cache memory. That is a faster memory that contains fragments of the code and data currently being used or, more precisely, likely to be used. Initially caches were external fast RAM, such as SRAM, but were quickly integrated into the microprocessors themselves as on chip memory.

Cache Requirements

The idea behind the use of a cache makes some assumptions about the way programs tend to execute. Most programs have a linear flow for more than just a few instructions before it might branch or jump, and also mostly works on local data within those fragments of code, say on a section of memory on the stack. If a program did truly jump around at random memory locations, and access random areas of data memory a cache would not be an advantage—actually, it would slow it down.

Given this, a cache design would need to access sections of instructions or data from main memory to place into the cache memory. This is actually efficient from an SDRAM access point of view as well. Since, in modern CPU systems, multiple processes are running, it will need to keep several disparate fragments in the cache, and know which memory fragments these are (i.e. their address) and also, given the finite size of cache memory that will be available, it will need some mechanism for writing back updated fragments to main memory, in the correct place, so it can use that part of the cache memory for a new fragment, as needed.

As we shall see, the way caches work means that even they may take multiple CPU clock cycles to access, though less than accesses to main memory. In ARM (and I assume other) systems there may be some tightly coupled memory (TCM) which would be single cycle memory, holding important, time critical, code that needs fast execution, such as initial interrupt routines or special operating system code and data. TCM, however, is not to be confused with cache memory. TCM would sit at some fixed place in the memory map and would not be cached.

Caches can also multi-layered. The CPU may access memory via a cache (let’s say level 1) which is very fast, but this limits the amount that can be deployed. There is a (rough) inverse relationship between RAM size and speed. So many systems will have another layer of cache (layer 2) underneath layer 1, which is larger but slower—though still faster than direct access to main memory. If data is not in the L1 cache, L2 is inspected and L1 updated if present, else main memory is accessed. System can even have a third layer (L3).

Many modern processors employ a Harvard architecture. That is, there are separate busses for instructions and data. Many systems reflect this architecture by having separate instruction and data level 1 caches, though level 2 would be common. Below is a block diagram of the Xilinx Zynq-7000 SoC FPGA.

No alt text provided for this image

Highlighted in yellow, at the centre of the diagram, are the blocks for the ARM Cortex-A9 processors, indicating 32Kbyte instruction and data L1 caches, with a common 512 Kbyte L2 caches. This is mirrored in other similar devices, such as the Intel Cyclone V FPGA and gives an idea of a typical SoC setup.

So, after this overview, let’s move on to how a cache is constructed.

Cache Terminology

There is a lot of potentially confusing terminology surrounding caches, but it is all for good reason. We need to define this terminology so that we use a common, understood language.

A cache that stored individual bytes, or even words of the basic architecture size (e.g., 32 bits) as the size of fragment mention before, would not work and does not fit with the assumption about instructions and data being accessed from locally contiguous memory sections. What the size of these fragments needs to be is part of the design of the cache and is a trade-off between efficiency when accessing a block from main memory, the cache size, and the likelihood of reading values that are not accessed before the data is removed from the cache once more. This fragment of the chosen size is called a cache line. Typical sizes might range from 8 to 256 bytes for an L1 cache (in powers of 2). The cache line fragments would be aligned in memory to its size so that, for example, 32-byte cache line fragments will have start addresses at multiples of 32 bytes—0, 32, 64 etc.

Obviously, a single cache line in memory would not be very efficient, as it would quickly need to be replaced with a new line, with all the delay that that would cause. So, the cache has a number of cache line spaces, and the collection of cache-lines is called a set. A typical set size might range from 128 to 16K cache-lines.

A cache may also have more than one set. This is known as the number of ways and might range from 1 to 4. The total size of the cache, in bytes, is just these three numbers multiplied together. For example, if a cache-line size is 32 bytes, with a set size of 128 and is a 2-way cache, then this is 32 × 128 × 2 = 8Kbyte cache. Below is a table showing some real-world processor systems and their level 1 cache parameters (some of which are configurable):

No alt text provided for this image

Associativity

Another new term (unfortunately) that defines how a cache will function—associativity. Basically there are two types of associativity

  • Fully associative
  • Set associative

What it indicates is how the particular cache-line addresses are arranged and searched for to see if an address being accessed is actually in the cache.

Fully Associative Caches

A fully associative cache is one where a cache-line can be placed anywhere within a set. The data will be placed in the cache RAM and, in order to identify where in memory that cache line resides, a ‘tag’ RAM will store the relevant address bits. In a fully associative cache, since this can be in any aligned location in the cache, when checking that an access is to an address in the cache, every location within the tag RAM will need to be checked to see if the access matches an address within one of the store cache-lines.

Functionally, this is the simplest to understand. One has a finite number of cache-lines spaces in a set, and any can be used to store a given cache-line. When it’s full, some mechanism (more later) can be used to release one of the lines currently in use and replace with the new required cache-line. From an implementation point of view, this is an expensive mechanism. Each lookup of the tag RAM requires all entries to be inspected. Normally this would be via a content addressable memory (CAM), sometimes known as an associative memory. These are larger and slower than standard RAMs, but are used in other applications, such as network address lookups and are sometimes employed as caches—but not too often, so I won’t delve further here. Once we have looked at Set Associative caches, it should become obvious how a fully associative cache would work, just with less restrictions.

Set Associative Caches

Set associative caches are not as flexible as fully associative caches, but are much cheaper, faster, and simpler to implement and are, thus, more widespread. I mentioned, in passing, that a cache has two RAMs, the cache RAM itself, and the tag RAM—which would be a CAM for fully associative. The diagram below shows the set up for a single set for a set associative cache.

No alt text provided for this image

The diagram shows an example of a cache which has 32-byte cache-lines and a 16 entry set, giving a 512-byte cache RAM, and a tag ram with 16 entries of tag width bits. This is probably not a practical cache but is useful as an illustration.

In a set associative cache, a cache-line can’t be store anywhere in the cache but is a function of some bits in its address. A cache-line’s address is always aligned with is size, so the bottom bits of a cache-line’s address are always 0. When accessing a particular byte or word within a cache line, the access bottom address bits are an offset within the cache-line. For a 32-byte cache-line, then, the bottom 5 bits are an offset into the cache-line. The next n bits are used to determine which entry in the cache set the cache-line will reside. For a set of 16 entries (as per this example), this is the next 4 bits—bits 8 down to 5. All addresses that have this same value in these ‘index’ bits must uses that entry in the set. The remaining bits are the ‘tag’ and it is these bits that are store in the tag RAM, along with some control and status bits (more later).

When a processor does a load or a store to a particular address, a lookup in the tag RAM will be performed by breaking up the access address bits into the index and TAG. The index will address the entry from the tag RAM and the data read compared with the tag bits of the access address. If the values match—a ‘hit’—then the addressed data is in the cache, and the address of the data in the cache RAM is the index × cache-line size + offset, and the data retrieved for a load, or written for a store. If the values don’t match—a ‘miss’—then the data is not in the cache, and the cache-line will need to be replaced with the data from main memory. The existing data in the particular cache-line may have to be written to main memory first if it has been updated without being written at the point of the miss. Once the cache-line has been replaced, then the comparison will hit, and the data read/written as for a first-time hit.

This description is for a 1 way set associative cache. This will function perfectly well but is limited in that addresses with the same index bits will clash on the same set entry, increasing the likelihood that an entry won’t be in the cache and that an entry will have to be swapped. A solution is to increase the number of sets available—the ‘ways’.

Multi-way Set Associative Caches

If a cache has multiple sets (i.e., the number of ways) then a cache-line address clash is reduced as it can reside in any of the ways available. As mentioned before, typical values for the number of multiple ways is 2 or 4. The diagram below extends the example by increasing the number of ways to 4.

No alt text provided for this image

Now the index bits can index into any of the 4 set entries in the ways. When matching a load or store address, all four tag RAMs are read at the index and all four tags compared. If any of them match, this is a hit, and the ‘way’ that contains the match is flagged to access the data from the cache RAM. The cache RAM might also be four different RAMs, but the way bits (2 bits for a 4-way cache) could just as easily be used as the top address bits in a larger single RAM. The choice will depend on factors like speed and resource requirements. Separate cache RAMs allow pre-fetching of data (on reads) and will be smaller and thus faster but will take up more ASIC real-estate.

If none of the ways contains a match, then this is a miss, but now an added complication arises as to which way to update? In the single set there was only one choice. Now we have to choose between n different ways (4 in the example). Common algorithms employed are

  • Round-robin
  • Least recently used (LRU)

The first is the easiest to understand and implement. For a give index one just loops round the way count each time an entry is updated with a new line: 0, 1, 2, 3, 0, 1…, and so on. This round-robin count, however, must be kept for each line in the set as there is no association between the lines and when they are updated.

In least recently used implementation a tally must be kept of when a particular way was accessed at a given index. When a miss occurs, the entry that was accessed the furthest back in time is chosen to be updated. This sound like it might be complicated, but the number of ways is not likely to be huge (4 is a typical value, as we’ve seen). The way I have approached this before is to assign each entry across the ways a value 0 to n, with 0 being the least recent access and n being the most recent. For a 4-way cache, n is 3. Each time there is a hit on a given index, the matching way has its LRU count updated to 3. The value it had before is used to flag that any way with an LRU value greater than this is decremented. Everything else is left unchanged. For example, if a way is accessed with previous value of 1, then the ways with values 3 and 2 are decremented to become 2 and 1, and the matched way entry set to 3. This maintains the LRU counts across the ways. Note that an LRU count must be kept for each set index, as per the round-robin method.

Cache Control and Status Bits

Along with the address TAG bits, the tag RAM will usually have addition control and status bits. As a minimum each entry will have the following bits

  • Valid
  • Dirty

The valid bit indicates that a valid cache-line is stored at that location. After reset (or a cache invalidation), there are no cache-line in the cache, and the valid bits are reset. When doing an address match, a valid bit that is clear forces a miss.

The dirty bit is used to indicate that a write has been performed at the cache-line entry, but that entry has not been written to main memory. When a cache line needs to be replaced, if the dirty bit is set, then the current cache-line must be written back to main memory before it can be updated. If the entry has never been written, then this write back can be skipped, saving time.

Additional, but optional, bits are often seen. Some 'accessed' bits may be present to store things such as the LRU count bits, used in choosing which line across the ways is to be replaced. Also, it may contain permission bits, indicating read, write, execute, and privilege level permissions. This is only usually done if the system does not have a separate memory protection unit, or an MMU managing these access permissions.

Write Policies

Most commonly, because it’s the best performance, a cache has a write policy of ‘write-back’. That is, it will write-back a cache-line to memory only when it is to be replaced and has the dirty (and valid) bit set.

Another policy used is that of write-through. Here a cache-line is written back to memory every time it is written to from the processor. This does not mean it is invalidated or replaced but means that the line is written. This increases the number of writes to main memory over a write-back policy, making it have a decreased performance. It is used where more than one processing element or other function have access to main memory and need to access data from the processor updating the cache. If a write-back policy, there can be a mismatch of data in a dirty cache-line and main memory for an indefinite amount of time. With write-through the data ends up in main memory after just the cache-line write latency.

Many caches also allow the ability to flush them or to invalidate them—something that an operating system might want to do, rather than a user application. Flushing requires that all dirty entries in the cache be written to main memory. The cache remains valid and can still have hits. If invalidated it is both flushed and all valid bits cleared so that no matches on previous cache-lines are possible.

Cache Coherency

The issue of dirty cache entries and main memory mismatches is a problem known as cache coherency. In a single processor system, when it is the sole accessor of the main memory, this is not an issue. When there’s more than one device in the system with memory access, this might be a problem, if the blocks need to communicate through main memory.

We’ve seen how a write-through policy might solve this issue, but it has a high performance penalty. Also flushing ensures valid main memory but requires co-ordination with operating system software. Other solutions exist to address cache coherency. If you inspect the block diagram of the Xilinx Zynq-7000 from earlier, you will see a snoop control unit (SCU). Connected to it is a bus called ACP—an Accelerator Coherency Port. This allows, via the SCU, to check whether a memory access is in the cache and retrieve the data from there, or whether to access directly from memory. There is a penalty for this cache lookup, but it is smaller than for write-through, where every cache-line is written back whether it needs to be accessed from another source or not. Modern busses and interconnect protocols, such as AXI, support whether a cache coherent access is required or not, giving control to the lookup penalty for only those accesses that require it. Beyond this, for large distributed systems, the Coherent Hub Interface (CHI) interconnect uses message transactions for cache coherent accessing across many caches. These kinds of transactions were employed in supercomputers when I was working in that field and are now making their way into SoCs and chips with high processor counts to solve these types of coherency problems.

As you can see, the use of a cache solves lots of problems in slow memory accesses but create a set of new problems in coherency which must be addressed.

Cache Regions

The assumption so far is that all processor loads and stores are to go through the cache, but not all accesses to the memory map go to main memory. In an SoC, the control and status registers of peripherals may be memory mapped in the same address space. It would not be useful, for instance, to cache accesses to control registers when writing, say, a ‘go’ bit to start a transmission. This is controlled by defining cache regions.

At its simplest a system might mirror an address range, so that a 1 GByte region exists at 0x00000000 to 0x3fffffff and is mirrored at 0x80000000 to 0xbfffffff. The first of these regions might be cached whilst the second region, though accessing the same physical memory, would be un-cached. The top bit of the address defines whether the cache is bypassed or not.

More generally, a cache system may have the ability to define multiple regions of cached and un-cached accesses. These might be separate for instruction and data caches. The LatticeMico32 softcore processor’s cache, for example, has just such definable regions. The diagram below shows some examples of the configurability of cached regions.

No alt text provided for this image

More sophisticated systems may allow more, and discontiguous, regions to be defined.

Conclusions

In this article was discussed how there is a mismatch on processor speed versus accesses to main memory, causing bottlenecks in programs at loads and store, and how caches can help solve some these issues. Caches can take on a couple of different forms of associativity, with advantages and disadvantages of both.

Having looked at some practical cache architectures it was seen that caches create new problems of coherency across a distributed system. This was looked at in terms of write policy and cache coherent accesses via snoop control units and an ACP.

Hopefully this has given you some insight into how caches work, and to terms that you may have heard but were unfamiliar to their meaning. It may be that you will never need to deal with caches directly, or implement new cache systems yourself, but I think understanding them can be crucial when working in systems with these functions. As a software or logic engineer, using the system efficiently with caches and knowing when coherency might be an issue and how to deal with it is useful knowledge.

In the next article we look at virtual memory, memory protection and memory management units (MMUs) that implement this.

Prasanth S.

FPGA Design and Emulation Engineer

2y

Do you write any book?

Like
Reply

To view or add a comment, sign in

More articles by Simon Southwell

  • Instruction Set Simulators, gdb, IDEs and Co-simulation

    Instruction Set Simulators, gdb, IDEs and Co-simulation

    Introduction I have previously discussed instruction set simulators (ISS) in various contexts, including a discussion…

    1 Comment
  • The Python/C Interface

    The Python/C Interface

    Introduction In this article I want to talk about the Python C interface. Now that can be one of two directions, of…

    2 Comments
  • Logic Development and Make

    Logic Development and Make

    Introduction The ‘make’ utility has been around a long time now (since 1976), and my relationship with it isn’t much…

    3 Comments
  • Ethernet and TCP/IP

    Ethernet and TCP/IP

    Introduction I have had a few requests to cover something about ethernet over the last few months and so I’ve finally…

    1 Comment
  • Performance Measurements of VProc on Verilator

    Performance Measurements of VProc on Verilator

    Introduction I have written about the VProc virtual processor before, in terms of what it is, what it is used for, how…

    5 Comments
  • Introduction to USB: Part 5

    Introduction to USB: Part 5

    Introduction In the first 4 articles in this series, we got ourselves to a point just shy of transferring data in USB…

    1 Comment
  • Introduction to USB: Part 4

    Introduction to USB: Part 4

    Introduction In the previous article we looked at the USB 3 physical layer, looking at encoding, scrambling and the…

  • Introduction to USB: Part 3

    Introduction to USB: Part 3

    Introduction In the previous two articles (see here and here) we got up to USB 2.1 with a half-duplex differential pair…

    1 Comment
  • The VProc Virtual Processor VIP

    The VProc Virtual Processor VIP

    Introduction I have written about the VProc virtual processor verification IP in a previous article, which was the…

    1 Comment
  • Introduction to USB: Part 2

    Introduction to USB: Part 2

    Introduction In the first article of this series we got up to the packet level for USB 1.x and 2.

    3 Comments

Insights from the community

Others also viewed

Explore topics