Bits, cycles and polygons. Engine programmer at Team Bondi.
Posts by Luke Hutchinson
  1. The best of the best, or the lowest of the low ? ( Counting comments... )
  2. 3... 2... 1... GO! Roll-Your-Own Countdown Event Synchronization ( Counting comments... )
  3. Memory Address Translation ( Counting comments... )
  4. Getting Down and Dirty with MESI Caches ( Counting comments... )
  5. Old Tricks, New Dogs: Removing Branches ( Counting comments... )
  6. Dual Issue and Software Pipelining on SPUs ( Counting comments... )
  7. Instruction Level Parallelism ( Counting comments... )
Technology/ Code /

How a CPU converts an address specified in a load or store instruction, to a physical address, is not something that we as games programmers typically need to worry about day-to-day. One exception to that would have been back in ye olde days of 32-bit DOS programs running on DOS Protected Mode Interface (DPMI). Still, the details are quite interesting, and can be helpful when trying to understand some performance issues. This post attempts to give a working knowledge of how address translation works on the x86 and Cell architectures, without getting too bogged down in the exact finer details. Also wraps up a couple loose ends in with my previous post on memory caches, "Getting Down and Dirty with MESI Caches" (http://altdev.co/2011/07/09/getting-down-and-dirty-with-mesi-caches/).

CPUs will typically boot up in "real mode", where addresses specified by the running code are treated directly as a real/physical address. The operating system then configures and enables address translation. This allows all kinds of useful things like

  • Protect the OS from malicious applications, by not allowing the applications to see any sensitive memory areas.
  • Application code can be protected from modification, and data from execution.
  • Virtual memory that maps a large virtual memory space onto a smaller physical one.
  • Guard pages to catch stack overflows.
  • Applications can all be linked to a fixed address range, removing the need to relocate them when loading.

x86

The x86 system is conceptually nice and simple, so its a great place to start before we look at the Cell.

In 64-bit mode, the x86 translates "linear addresses" specified by code to physical addresses. A four level hierarchy of data structures is accessed by the CPU. The most significant bits of the linear address index into the Page Map Level 4 (PML4). The PML4 Entry specifies the physical address of a Page Directory Pointer Table (PDPT). The next most significant bits of the linear address index into this PDPT. The other levels of the hierarchy, the Page Directory (PD) and Page Table (PT) are accessed in a similar fashion. The remaining least significant bits of the linear address give an offset into the memory page referenced by the final Page Table Entry (PTE).

The (little endian) bit assignments are

47:39 PML4E index
38:30 PDPTE index
29:21 PDE index
20:12 PTE index
11:0 offset

The layout just described is for 4KB pages (2^num offset bits = 4096). Larger page sizes are supported by dropping the lower levels of the hierarchy and increasing the size of the offset. 2MB pages are stored in a PDE and 1GB pages in a PDPTE.

So what about bits 63:48 ? They should be the same as bit 47, ie the linear address is sign extended to 64-bits. This is known as the canonical form, and you'll get an exception if the most significant bits are anything else.

There are also some legacy addressing modes supported.

Physical Address Extension (PAE) is similar to 64-bit mode, but loses the PML4, and has only a two bit index (31:30) into the PDPT.

32-bit mode loses both the PML4 and PDPT. Also, PDEs and PTEs halve in size from 8-bytes to 4-bytes, though the tables stay the same size (4KB), so an extra bit is used to index each.

31:22 PDE index
21:12 PTE index
11:0 offset

In all paging modes, the CPU has a single global (physical address) pointer to the top level of the hierarchy. This is stored in Control Register 3 (CR3).

When executing in a legacy, non-64 bit mode, another level of address translation is added, segmentation. Instead of code directly specifying a linear address, it instead works with logical addresses. Segmentation converts logical addresses to linear ones, the linear address is then converted to a physical address as described above. Logical addressing is performed by using a segment selector register (CS, DS, ES, SS, FS or GS). The segment selector references an entry in either the Global Descriptor Table (GDT) or Local Descriptor Table (LDT). The segment descriptors in these tables specify a linear base address that is added to the logical offset (as well as offset limits, and some other security info).

Actually, i kinda lied, just so that the segmentation could be ignored to start off with. The FS and GS segment bases do actually work in 64-bit mode, but apart from that, segmentation is not used.

Looking up all these tables for every memory access would be extremely slow, so instead they are cached. Two types of caching can be used. Translation Lookaside Buffers (TLBs) cache information for individual pages. Paging-structure caches can also cache the upper levels of the structure's hierarchy.

Just as the CPUID instruction can be used to determine dcache and icache configurations, it can also be used to find out the TLB setup. For example, my CPU has the following TLB related cache info (from CPUID, EAX=2),

Configuration
Descriptor
Meaning
03 data TLB, 4KB pages, 4-way set associative, 64 entries
5a data TLB, 2MB/4MB pages, 4-way set associative, 32 entries
b2 code TLB, 4KB pages, 4-way set associative, 64 entries
55 code TLB, 2MB/4MB pages, fully associative, 7 entries
ca unified 2nd-level TLB, 4KB pages, 4-way set associative, 512 entries

Notice that small and large pages have separate TLBs.

Caching of segment base and limit values is done using hidden registers that are loaded only when the segment selector changes. For example, a LES instruction to load into the ES segment selector, also performs the GDT/LDT lookup and sets the hidden ESBASE and ESLIMIT registers. There is no specialized memory cache for the GDT and LDT tables as there is the TLB.

All these x86 details are well documented these days (see "Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, System Programming Guide, Part 1" http://www.intel.com/products/processor/manuals/). But in the past a lot wasn't documented publicly. Robert Collins wrote an extremely interesting series of articles for Dr Dobb's Journal, in one of these he reverse engineered PAE when it was first introduced in the Pentium Pro (http://www.rcollins.org/ddj/Jul96/Jul96.html).

Cell

IBM uses different terminology from Intel, so lets translate our discussion on translation. Code works in effective addresses (EAs), as opposed to logical addresses. EAs are converted to virtual addresses (VAs) via segmentation. So IBM's virtual addresses are roughly equivalent to Intel's linear addresses. VAs are then converted to real addresses (RAs) via paging. Real addresses are the same as physical addresses.

The segmentation mechanism in the Cell is quite different to the legacy segmentation in x86 PAE and 32-bit modes. The most significant 36-bits of the 64-bit EAs are the effective segment ID (ESID). On some other PowerPC architectures, there is a large set of segment registers that map ESIDs to virtual segment IDs (VSID). The 52-bit VSID is then combined with combined with the lower 28 bits of the EA to form an 80-bit VA.

A Segment Lookaside Buffer (SLB) is used to cache ESID to VSID mappings. The Cell actually does away with segment registers completely and uses just the SLB cache. Without segment registers, the SLB is completely under software control. The OS handles SLB misses and replaces an appropriate SLB entry so that effective to virtual translation can continue.

The Cell contains 64 SLB entries per PPE hardware thread, and 8 per SPE. Yes, each SPE's DMA controller (aka, the memory flow controller (MFC)) use the same memory translation as the PPE, though this is hidden from the SPU code, instead, the PPU code configures each SPE's in built memory management unit (MMU).

Once the SLB has been looked up to form the 80-bit virtual address, this is then converted by a paging mechanism to a 62-bit real address.

We saw that the x86 used a hierarchy of tables to do it's mapping to a physical address, the PowerPC architecture takes an interestingly different approach here. The PowerPC architecture has a flat, hashed page table (HTAB) structure.

Page table entries (PTEs) are organized into groups of 8, called page table entry groups (PTEGs). The HTAB is an array of PTEGs. The most significant bits of the VA form a virtual page number (VPN). The VPN is transformed via two hash functions, a primary and a secondary function. Each hashed value gives the index of a PTEG within the HTAB, so there are 16 possible PTEs that are searched to find a one that matches the VPN. Because of hash collisions, not all valid PTEs can always be stored in the HTAB at the same time. So if the CPU can't find a PTE that matches the VPN, then a page fault exception is triggered so the OS can swap in the correct PTE.

Like the x86 architecture, a TLB is used to cache PTEs. In the case of the Cell, there is a single 1024-entry, 4-way set associative TLB shared by both of the PPE hardware threads. There is also a 256-entry, 4-way set associative TLB per SPE.

The hardware HTAB lookups can actually be disabled, using only the TLB for address translation. Allowing for software control of loading PTEs into the TLB, in a similar manner to the software managed SLB. This is not part of the standard PowerPC architecture, but a Cell specific extension. Not 100% sure, but think the Playstation 3 does use the hardware HTAB.

To speed things up further, the PPE contains another level of caching on top of the SLB and TLB. Effective to real address translation (ERAT) entries directly cache the EA to RA mappings, bypassing the virtual address step completely when there is an ERAT hit.

The PPE contains a data ERAT (D-ERAT) and instruction ERAT (I-ERAT). Both of these are 64-entry, 2-way set associative caches, and are shared by both hardware threads.

ERAT hits are obviously good, but misses suck more than you may first guess. My previous post mentioned how dcache misses don't pay the cost on a load instruction, but rather they stall on the first instruction to use the destination register of the load. Unfortunately a ERAT miss doesn't have this same luxury, and will stall on the load. This is even true for cache prefetches like DCBT, and is one of the reasons why you should always profile prefetching code, and never prefetch NULL. It is not an "error" to prefetch NULL, you wont get an exception, but you will get a D-ERAT miss which can slow things down quite a bit. If a pointer may be NULL, make sure you have a branch around any prefetch, or branch free select another "safe" pointer like the stack.

The "PPC Version 2.02 Book III", "Chapter 4. Storage Control", has a detailed description of the PowerPC architecture if you're interested in more details, such as page sizes and the hash functions (http://www.ibm.com/developerworks/systems/library/es-archguide-v2.html). The "Cell Broadband Engine Programming Handbook", "Chapter 6. Cache Management", discusses Cell specific details (https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14FE70D5852575CA0074E8ED).

What Type Of Address Is Used In The Cache?

To tie things all up nicely, going to go back and fill in an interesting detail i purposely avoided in my previous post on caches. There we only mentioned "addresses", and did not distinguish between the different address spaces. On the bus side, cache coherency protocols need to work with real addresses, but from the CPU side, there are a few options.

A simple approach would be convert effective addresses to real addresses before looking up the cache, but this is not necessarily the fastest way to do things. In the Cell PPE's L1 icache and dcache, effective addresses are used to index into the correct cache set. The PPU then needs to check each way of the set to see if there is a real address hit. Each cached line is tagged with the real address so this comparison can be performed. This allows the effective to real translation to be done in parallel to retrieving the line tags. This approach is known as "EA indexing, RA tagging". The L2 is RA indexed and tagged.

EA indexing and RA tagging, does have aliasing issues when multiple EAs map to the same RA, since the same RA could be mapped into two different sets in the cache. Afaict, it is always up to software to either simply not create aliases, make sure aliased EAs map to the same set, or provide explicit synchronization.

Can't find any definitive details on what cache indexing and tagging Intel uses, but the following link, http://software.intel.com/en-us/articles/recap-virtual-memory-and-cache/#TLB_access, seems to indicate they also use linear address indexing and physical address tagging.

Well that concludes a two post whirlwind tour of memory caching. Hope you found it interesting, though guess there is only a handful of people like me who get really excited by this stuff :). Not to completely sell out and try to get blog page hit counts up, but planning some more "practical" topics for the next couple posts.